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Figure  1:  When  two  persons  pass  near  each  other,  their  identities  can  get  con¬ 
fused. 


1  Introduction 

Probability  distributions  over  permutations  arise  in  a  diverse  variety  of  real 
world  problems.  While  they  were  perhaps  first  studied  in  the  context  of  gam¬ 
bling  and  card  games,  they  have  now  been  found  to  be  applicable  to  many 
important  problems  in  multi-object  tracking,  information  retrieval,  webpage 
ranking,  preference  elicitation,  and  voting. 

As  an  example,  consider  the  problem  of  tracking  n  persons  based  on  a  set  of 
noisy  measurements  of  identity  and  position.  A  typical  tracking  system  might 
attempt  to  manage  a  set  of  n  tracks  along  with  an  identity  corresponding  to 
each  track,  in  spite  of  ambiguities  from  imperfect  identity  measurements.  When 
the  persons  are  well  separated,  the  problem  is  easily  decomposed  and  measure¬ 
ments  about  each  individual  can  be  clearly  associated  with  a  particular  track. 
When  persons  pass  near  each  other,  however,  confusion  can  arise  as  their  signal 
signatures  may  mix;  see  Figure  1.  After  the  individuals  separate  again,  their 
positions  may  be  clearly  distinguishable,  but  their  identities  can  still  be  con¬ 
fused,  resulting  in  identity  uncertainty  which  must  be  propagated  forward  in 
time  with  each  person,  until  additional  observations  allow  for  disambiguation. 
This  task  of  maintaining  a  belief  state  for  the  correct  association  between  object 
tracks  and  object  identities  while  accounting  for  local  mixing  events  and  sensor 
observations,  was  introduced  in  (Shin  et  al.,  2003)  and  is  called  the  identity 
management  problem. 

The  identity  management  problem  poses  a  challenge  for  probabilistic  infer¬ 
ence  because  it  needs  to  address  the  fundamental  combinatorial  challenge  that 
there  is  a  factorial  number  of  associations  to  maintain  between  tracks  and  iden¬ 
tities.  Distributions  over  the  space  of  all  permutations  require  storing  at  least 
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n!  —  1  numbers,  an  infeasible  task  for  all  but  very  small  n.  Moreover,  typical 
compact  representations,  such  as  graphical  models,  cannot  efficiently  capture 
the  mutual  exclusivity  constraints  associated  with  permutations. 

While  there  have  been  many  approaches  for  coping  with  the  factorial  com¬ 
plexity  of  maintaining  a  distribution  over  permutations,  most  attack  the  problem 
using  one  of  two  ideas  —  storing  and  updating  a  small  subset  of  likely  permu¬ 
tations,  or,  as  in  our  case,  restricting  consideration  to  a  tractable  subspace  of 
possible  distributions.  (Willsky,  1978)  was  the  first  to  formulate  the  probabilis¬ 
tic  filtering/smoothing  problem  for  group- valued  random  variables.  He  proposed 
an  efficient  FFT  based  approach  of  transforming  between  primal  and  Fourier 
domains  so  as  to  avoid  costly  convolutions,  and  provided  efficient  algorithms 
for  dihedral  and  met  acyclic  groups.  (Kueh  et  al.,  1999)  show  that  probability 
distributions  on  the  group  of  permutations  are  well  approximated  by  a  small 
subset  of  Fourier  coefficients  of  the  actual  distribution,  allowing  for  a  principled 
tradeoff  between  accuracy  and  complexity.  The  approach  taken  in  (Shin  et  al., 
2005;  Schumitsch  et  al.,  2005;  Schumitsch  et  al.,  2006)  can  be  seen  as  an  algo¬ 
rithm  for  maintaining  a  particular  fixed  subset  of  Fourier  coefficients  of  the  log 
density.  Most  recently,  (Kondor  et  al.,  2007)  allow  for  a  general  set  of  Fourier 
coefficients,  but  assume  a  restrictive  form  of  the  observation  model  in  order  to 
exploit  an  efficient  FFT  factorization. 

In  this  work1 ,  we  present  several  contributions  which  generalize  and  improve 
upon  the  past  related  work.  We  present  a  new  and  simple  algorithm,  called 
Kronecker  Conditioning,  which  performs  all  probabilistic  inference  operations 
completely  in  the  Fourier  domain,  allowing  for  a  principled  tradeoff  between 
computational  complexity  and  approximation  accuracy.  Our  approach  is  fully 
general,  in  the  sense  that  it  can  address  any  transition  model  or  likelihood 
function  that  can  be  represented  in  the  Fourier  domain,  such  as  those  used  in 
previous  work,  and  can  represent  the  probability  distribution  using  any  desired 
number  of  Fourier  coefficients.  We  analyze  the  errors  which  can  be  introduced 
by  bandlimiting  a  probability  distribution  and  show  how  they  propagate  with 
respect  to  inference  operations.  Approximate  conditioning  based  on  bandlimited 
distributions  can  sometimes  yield  Fourier  coefficients  which  do  not  correspond 
to  any  valid  distribution,  even  returning  negative  “probabilities”  on  occasion 
—  we  address  this  issue  by  presenting  a  method  for  projecting  the  result  back 
into  the  polytope  of  coefficients  which  correspond  to  nonnegative  and  consistent 
marginal  probabilities  using  an  efficient  quadratic  program.  Finally,  we  empir¬ 
ically  evaluate  the  accuracy  of  approximate  inference  on  simulated  data  drawn 
from  our  model  and  further  demonstrate  the  effectiveness  of  our  approach  on  a 
real  camera-based  multi-person  tracking  scenario. 

1A  shorter  version  this  work  appeared  in  (Huang  et  al.,  2007).  We  provide  a  more  complete 
discussion  of  our  Fourier  based  methods  in  this  extended  paper. 
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Figure  2:  Identity  Management  example.  Three  people,  Alice,  Bob  and  Charlie 
enter  a  room  and  we  receive  a  position  measurement  for  each  person  at  each 
time  step.  With  no  way  to  observe  identities  inside  the  room,  however,  we  are 
confused  whenever  two  tracks  get  too  close.  In  this  example,  track  1  crosses  with 
track  2,  then  with  track  3,  then  leaves  the  room,  at  which  point  it  is  observed 
that  the  identity  at  Track  1  is  in  fact  Bob. 

2  Filtering  over  permutations 

As  a  prelude  to  the  general  problem  statement,  we  begin  with  a  simple  identity 
management  problem  on  three  tracks  (illustrated  in  Figure  2)  which  we  will  use 
as  a  running  example.  In  this  problem,  we  observe  a  stream  of  localization  data 
from  three  people  walking  inside  a  room.  Except  for  a  camera  positioned  at  the 
entrance,  however,  there  is  no  way  to  distinguish  between  identities  once  they 
are  inside.  In  this  example,  an  internal  tracker  declares  that  two  tracks  have 
‘mixed’  whenever  they  get  too  close  to  each  other  and  announces  the  identity 
of  any  track  that  enters  or  exits  the  room. 

In  our  particular  example,  three  people,  Alice,  Bob  and  Cathy,  enter  a  room 
separately,  walk  around,  and  we  observe  Bob  as  he  exits.  The  events  for  our 
particular  example  in  the  figure  are  recorded  in  Table  1.  Since  Tracks  2  and  3 
never  mix,  we  know  that  Cathy  cannot  be  in  Track  2  in  the  end,  and  furthermore, 
since  we  observe  Bob  to  be  in  Track  1  when  he  exits,  we  can  deduce  that  Cathy 
must  have  been  in  Track  3,  and  therefore  Alice  must  have  been  in  Track  2. 
Our  simple  example  illustrates  the  combinatorial  nature  of  the  problem  —  in 
particular,  reasoning  about  the  mixing  events  allows  us  to  exactly  decide  where 
Alice  and  Cathy  were  even  though  we  only  made  an  observation  about  Bob  at 
the  end. 


Event  # 

Event  Type 

1 

Tracks  1  and  2  mixed 

2 

Tracks  1  and  3  mixed 

3 

Observed  Identity  Dob  at  Track  1 

Table  1:  Table  of  Mixing  and  Observation  events  logged  by  the  tracker. 
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In  identity  management,  a  permutation  a  represents  a  joint  assignment  of 
identities  to  internal  tracks,  with  a(i)  being  the  track  belonging  to  the  ?'th 
identity.  When  people  walk  too  closely  together,  their  identities  can  be  confused, 
leading  to  uncertainty  over  a.  To  model  this  uncertainty,  we  use  a  Hidden 
Markov  Model  (HMM)  on  permutations,  which  is  a  joint  distribution  over  latent 
permutations  erf1-* , . . . ,  <j^t\  and  observed  variables  . . . ,  z^  which  factors 
as: 


P(a«, . . . ,  a^T\z^\. . . ,  z^)  =  P(aW)P(ZW  |a«)  f[  P(*  V^)- 

t= 2 

The  conditional  probability  distribution  P(a<'t'>\a(-t~1l)  is  called  the  transition 
model,  and  might  reflect,  for  example,  that  the  identities  belonging  to  two 
tracks  were  swapped  with  some  probability  by  a  mixing  event.  The  distribution 
P(z^\a^)  is  called  the  observation  model,  which  might,  for  example,  capture 
a  distribution  over  the  color  of  clothing  for  each  individual. 

We  focus  on  filtering,  in  which  one  queries  the  HMM  for  the  posterior 
at  some  time  step,  conditioned  on  all  past  observations.  Given  the  distribu¬ 
tion  P(a(-t',\z(-1\  . . . ,  z^),  we  recursively  compute  P(a(-t+1^\z(-1\  . . . ,  z^1^)  in 
two  steps:  a  prediction/rollup  step  and  a  conditioning  step.  Taken  together, 
these  two  steps  form  the  well  known  Forward  Algorithm  (Rabiner,  1989).  The 
prediction/rollup  step  multiplies  the  distribution  by  the  transition  model  and 
marginalizes  out  the  previous  time  step: 

P(a^\z^,. . . ,  z&)  =  ^  P(a^  | *W)P(aM \ZW^  . . ,  z®). 

<t(‘> 

The  conditioning  step  conditions  the  distribution  on  an  observation  z^t+1l  using 
Bayes  rule: 

P(a^\z^\  .  .  .  ,Z^)  OC  P(z(t+1)  |<J(t+1))P(cr(t+1)  |^(1), . . . ,  *W). 

Since  there  are  n!  permutations,  a  single  iteration  of  the  algorithm  requires 
0((n!)2)  flops  and  is  consequently  intractable  for  all  but  very  small  n.  The 
approach  that  we  advocate  is  to  maintain  a  compact  approximation  to  the  true 
distribution  based  on  the  Fourier  transform.  As  we  discuss  later,  the  Fourier 
based  approximation  is  equivalent  to  maintaining  a  set  of  low-order  marginals, 
rather  than  the  full  joint,  which  we  regard  as  being  analogous  to  an  Assumed 
Density  Filter  (Boyen  &  Roller,  1998).  Although  we  focus  on  HMMs  and  filter¬ 
ing  for  concreteness,  the  approach  we  describe  is  useful  for  other  probabilistic 
inference  tasks  over  permutations,  such  as  ranking  objects  and  modeling  user 
preferences. 
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3  Probability  Distributions  over  the  Symmetric 
Group 

A  permutation  on  n  elements  is  a  one-to-one  mapping  of  the  set  {1, . . . ,  n}  into 
itself  and  can  be  written  as  a  tuple, 

a  =  [cr(l)  cr( 2)  . . .  a(n)\, 

where  a(i)  denotes  where  the  ith  element  is  mapped  under  the  permutation 
(called  one  line  notation).  For  example,  a  =  [2  3  1  4  5]  means  that  cr(l)  =  2, 
er(2)  =  3,  cr(3)  =  1,  cr(4)  =  4,  and  cr(5)  =  5.  The  set  of  all  permutations  on  n 
elements  forms  a  group  under  the  operation  of  function  composition  —  that  is, 
if  (J i  and  (72  are  permutations,  then 

0-1(72  =  [01(02(1))  <7i(ct2(2))  o-i(<j2(n))] 

is  itself  a  permutation.  The  set  of  all  n!  permutations  is  called  the  Symmetric 
Group,  or  just  Sn. 

We  will  actually  notate  the  elements  of  Sn  using  the  more  standard  cycle 
notation,  in  which  a  cycle  ( i,j ,  k, . . .  ,t)  refers  to  the  permutation  which  maps 
i  to  j,  j  to  k,  . . . ,  and  finally  t  to  i.  Though  not  every  permutation  can  be 
written  as  a  single  cycle,  any  permutation  can  always  be  written  as  a  product  of 
disjoint  cycles.  For  example,  the  permutation  er  =  [2  3  1  4  5]  written  in  cycle 
notation  is  a  =  (1,  2,  3) (4) (5) .  The  number  of  elements  in  a  cycle  is  called  the 
cycle  length  and  we  typically  drop  the  length  1  cycles  in  cycle  notation  when  it 
creates  no  ambiguity  —  in  our  example,  a  =  (1,  2,  3) (4) (5)  =  (1,  2,  3).  We  refer 
to  the  identity  permutation  (which  maps  every  element  to  itself)  as  e. 

A  probability  distribution  over  permutations  can  be  thought  of  as  a  joint 
distribution  on  the  n  random  variables  (cr(l), . . . ,  cr(n))  subject  to  the  mutual 
exclusivity  constraints  that  P(a  :  <r(i)  =  cr(j))  =  0  whenever  i  7^  j.  For  ex¬ 
ample,  in  the  identity  management  problem,  Alice  and  Bob  cannot  both  be 
in  Track  1  simultaneously.  Due  to  the  fact  that  all  of  the  a(i)  are  coupled  in 
the  joint  distribution,  graphical  models,  which  might  have  otherwise  exploited 
an  underlying  conditional  independence  structure,  are  ineffective.  Instead,  our 
Fourier  based  approximation  achieves  compactness  by  exploiting  the  algebraic 
structure  of  the  problem. 

3.1  Compact  summary  statistics 

While  continuous  distributions  like  Gaussians  are  typically  summarized  using 
moments  (like  mean  and  variance),  or  more  generally,  expected  features,  it  is 
not  immediately  obvious  how  one  might,  for  example,  compute  the  ‘mean’  of  a 
distribution  over  permutations.  There  is  a  simple  method  that  might  spring  to 
mind,  however,  which  is  to  think  of  the  permutations  as  permutation  matrices 
and  to  average  the  matrices  instead. 
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Example  1.  For  example,  consider  the  two  permutations  e,  (1,2)  €  S3  (e  is  the 
identity  and  (1,2)  swaps  1  and  2).  We  can  associate  the  identity  permutation  e 
with  the  3x3  identity  matrix,  and  similarly,  we  can  associate  the  permutation 
(1,2)  with  the  matrix: 


(1,2) 


0  1  0 
1  0  0 
0  0  1 


The  ‘average’  of  e  and  (1,2)  is  therefore: 


'  1 

0 

0  ' 

1 

'  0 

1 

0  ' 

'  1/2 

1/2 

0  ' 

0 

1 

0 

+  2 

1 

0 

0 

= 

1/2 

1/2 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

As  we  will  later  show,  computing  the  ‘mean’  (as  described  above)  of  a  dis¬ 
tribution  over  permutations,  P,  compactly  summarizes  P  by  storing  a  marginal 
distribution  over  each  of  tr(l),  cr(2), . . . ,  <j(n),  which  requires  storing  only  0(n 2) 
numbers  rather  than  the  full  0(n\)  for  the  exact  distribution.  As  an  example, 
one  possible  summary  might  look  like: 


Alice 

Bob 

Cathy 

Track  1 

2/3 

1/6 

1/6 

Track  2 

1/3 

1/3 

1/3 

Track  3 

0 

1/2 

!/2 

Such  doubly  stochastic  “first-order  summaries”  have  been  studied  in  various 
settings  (Shin  et  al.,  2003;  Helmbold  &  Warmuth,  2007).  In  identity  manage¬ 
ment  (Shin  et  al.,  2003)2,  first-order  summaries  maintain,  for  example, 

P(Alice  is  at  Track  1)  =  2/3, 

P(Bob  is  at  Track  3)  =  1/2. 

What  cannot  be  captured  by  first-order  summaries  however,  are  the  higher  order 
statements  like: 

P(Alice  is  in  Track  1  and  Bob  is  in  Track  2)  =  0. 

Over  the  next  two  sections,  we  will  show  that  the  first-order  summary  of  a 
distribution  P(c r)  can  equivalently  be  viewed  as  the  lowest  frequency  coefficients 
of  the  Fourier  transform  of  P(cr),  and  that  by  considering  higher  frequencies, 

2  Strictly  speaking,  a  map  from  identities  to  tracks  is  not  a  permutation  since  a  permutation 
always  maps  a  set  into  itself.  In  fact,  the  set  of  all  such  identity-to-track  assignments  does  not 
actually  form  a  group  since  there  is  no  way  to  compose  any  two  such  assignments  to  obtain 
a  legitimate  group  operation.  We  abuse  the  notation  by  referring  to  these  assignments  as  a 
group,  but  really  the  elements  of  the  group  here  should  be  thought  of  as  the  ‘deviation’  from 
the  original  identity-to-track  assignment  (where  only  the  tracks  are  permuted,  for  example, 
when  they  are  confused).  In  the  group  theoretic  language,  there  is  a  faithful  group  action  of 
Sn  on  the  set  of  all  identity-to-track  assignments. 
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we  can  capture  higher  order  marginal  probabilities  in  a  principled  fashion.  Fur¬ 
thermore,  the  Fourier  theoretic  perspective,  as  we  will  show,  provides  a  natural 
framework  for  formulating  inference  operations  with  respect  to  our  compact 
summaries.  In  a  nutshell,  we  will  view  the  prediction/rollup  step  as  a  convolu¬ 
tion  and  the  conditioning  step  as  a  point  wise  product  —  then  we  will  formulate 
the  two  inference  operations  in  the  Fourier  domain  as  a  pointwise  product  and 
convolution,  respectively. 

4  The  Fourier  transform  on  finite  groups 

Over  the  last  fifty  years,  the  Fourier  Transform  has  been  ubiquitously  applied  to 
everything  digital,  particularly  with  the  invention  of  the  Fast  Fourier  Transform. 
On  the  real  line,  the  Fourier  Transform  is  a  well-studied  method  for  decomposing 
a  function  into  a  sum  of  sine  and  cosine  terms  over  a  spectrum  of  frequencies. 
Perhaps  less  familiar  though,  is  its  group  theoretic  generalization,  which  we 
review  in  this  section  with  an  eye  towards  approximating  functions  on  Sn.  For 
further  information,  see  (Diaconis,  1988)  and  (Terras,  1999). 

4.1  Group  representation  theory 

The  generalized  definition  of  the  Fourier  Transform  relies  on  the  theory  of  group 
representations,  which  formalize  the  concept  of  associating  permutations  with 
matrices  and  are  used  to  construct  a  complete  basis  for  the  space  of  functions 
on  a  group  G,  thus  also  playing  a  role  analogous  to  that  of  sinusoids  on  the  real 
line. 

Definition  2.  A  representation  of  a  group  G  is  a  map  p  from  G  to  a  set  of 
invertible  dp  x  dp  matrix  operators  which  preserves  algebraic  structure  in  the 
sense  that  for  all  <Ji,cr2  €  G,  p{a\a2)  =  p(tr i)  •  p(c r2).  The  matrices  which  lie  in 
the  image  of  p  are  called  the  representation  matrices,  and  we  will  refer  to  dp  as 
the  degree  of  the  representation. 

The  requirement  that  p(aia2)  =  p(&i)  •  p(p g)  is  analogous  to  the  property 
that  e'l(9l+82'>  =  e^1  •  el 82  for  the  conventional  sinusoidal  basis.  Each  matrix 
entry,  pij(<r)  defines  some  function  over  Sn: 

Pn{cr)  Pi2(cr)  ■■■  Pid„{o) 

,  s  P2l(<r)  P22(ct)  •••  p2dp{o) 

p(v)=  •  :  ..  ;  (4.1) 

Pd„l(cr)  Pdp  2(0-)  •"  Pdpdf,  (o’) 

and  consequently,  each  representation  p  simultaneously  defines  a  set  of  dp  func¬ 
tions  over  Sn.  We  will  eventually  think  of  group  representations  as  the  set  of 
Fourier  basis  functions  onto  which  we  can  project  arbitrary  functions. 

Example  3.  We  begin  by  showing  three  examples  of  representations  on  the 
symmetric  group. 
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1.  The  simplest  example  of  a  representation  is  called  the  trivial  representa¬ 
tion  :  Sn  — >  R1  x  1 ,  which  maps  each  element  of  the  symmetric  group 
to  1,  the  multiplicative  identity  on  the  real  numbers.  The  trivial  represen¬ 
tation  is  actually  defined  for  every  group,  and  while  it  may  seem  unworthy 
of  mention,  it  plays  the  role  of  the  constant  basis  function  in  the  Fourier 
theory. 

2.  The  first-order  permutation  representation  of  Sn,  which  we  alluded  to  in 
Example  1,  is  the  degree  n  representation,  T(„_ lfl)  (we  explain  the  termi¬ 
nology  in  Section  5)  ,  which  maps  a  permutation  a  to  its  corresponding 
permutation  matrix  given  by  [T(n-i,i){&)]ij  =  1  {cr(j)  =  T\ .  For  example, 
the  first- order  permutation  representation  on  S3  is  given  by: 


r(2,i)(e) 


1  0  0 
0  10 
0  0  1 


r(2,i)(l)  2) 


0  10 
10  0 
0  0  1 


'10  0' 

'001' 

r(2,i)(2, 3)  — 

0  0  1 

0  1  0 

r(2,l)  (1)  3)  = 

0  1  0 

1  0  0 

'001' 

'010' 

r(2,i)(l)  2, 3)  — 

1  0  0 

0  1  0 

r(2,i)  (1)  3, 2)  — 

0  0  1 

1  0  0 

3.  The  alternating  representation  of  Sn,  maps  a  permutation  a  to  the  deter¬ 
minant  of  ,  which  is  +1  if  a  can  be  equivalently  written  as  the 

coinposition  of  an  even  number  of  pairwise  swaps,  and  —1  otherwise.  We 
write  the  alternating  representation  as  /)( with  n  1  ’s  in  the  subscript. 
For  example,  on  S4,  we  have: 

P(i,i,i,i)((l>  2, 3))  =  P(i,i,i,i)((13)(12))  =  +1. 

The  alternating  representation  can  be  interpreted  as  the  ‘highest  frequency’ 
basis  function  on  the  symmetric  group,  intuitively  due  to  its  high  sensitivity 

to  swaps.  For  example,  if  T(i,...u{cr)  =  1,  then  T(  1 i)((12)<r)  =  — 1. 

In  identity  management,  it  may  be  reasonable  to  believe  that  the  joint 
probability  over  all  n  identity  labels  should  only  change  by  a  little  if  just 
two  objects  are  mislabeled  due  to  swapping  —  in  this  case,  ignoring  the 
basis  function  corresponding  to  the  alternating  representation  should  still 
provide  an  accurate  approximation  to  the  joint  distribution. 

In  general,  a  representation  corresponds  to  an  overcomplete  set  of  functions 
and  therefore  does  not  constitute  a  valid  basis  for  any  subspace  of  functions. 
For  example,  the  set  of  nine  functions  on  S3  corresponding  to  T(2:i)  span  only 
four  dimensions,  because  there  are  six  normalization  constraints  (three  on  the 
row  sums  and  three  on  the  column  sums) ,  of  which  five  are  independent  —  and 


so  there  are  five  redundant  dimensions.  To  find  a  valid  complete  basis  for  the 
space  of  functions  on  Sn ,  we  will  need  to  find  a  family  of  representations  whose 
basis  functions  are  independent,  and  span  the  entire  n!-dimensional  space  of 
functions. 

In  the  following  two  definitions,  we  will  provide  two  methods  for  construct¬ 
ing  a  new  representation  from  old  ones  such  that  the  set  of  functions  on  Sn 
corresponding  to  the  new  representation  is  linearly  dependent  on  the  old  rep¬ 
resentations.  Somewhat  surprisingly,  it  can  be  shown  that  dependencies  which 
arise  amongst  the  representations  can  always  be  recognized  in  a  certain  sense, 
to  come  from  the  two  possible  following  sources  (Serre,  1977). 

Definition  4. 


1.  Equivalence.  Given  a  representation  p\  and  an  invertible  matrix  C ,  one 
can  define  a  new  representation  p2  by  “changing  the  basis”  for  p\ : 


p2{a)  =  C  1  •  pi(cr)  ■  C. 


(4.2) 


We  say,  in  this  case,  that  p\  and  p2  are  equivalent  as  representations 
(written  pi  =  p2 ),  and  the  matrix  C  is  known  as  the  intertwining  operator. 
Note  that  dPl  =  dP2 . 

It  can  be  checked  that  the  functions  corresponding  to  p2  can  be  recon¬ 
structed  from  those  corresponding  to  p\.  For  example,  if  C  is  a  permuta¬ 
tion  matrix,  the  matrix  entries  of  p2  are  exactly  the  same  as  the  matrix 
entries  of  pi,  only  permuted. 

2.  Direct  Sum.  Given  two  representations  p\  and  p2,  we  can  always  form 
a  new  representation,  which  we  will  write  as  p\  ®  p2,  by  defining: 


Pi  ©  Pi{cr) 


_A 

’  Pi(cr) 

0 

0 

P2(cr)  _ 

pi  ©  p2  is  called  the  direct  sum  representation.  For  example,  the  direct 
sum  of  two  copies  of  the  trivial  representation  is: 


Pin)  ©  Pin)  (cr) 


1  0 
0  1  ’ 


with  four  corresponding  functions  on  Sn,  each  of  which  is  clearly  depen¬ 
dent  upon  the  trivial  representation  itself. 

Most  representations  can  be  seen  as  being  equivalent  to  a  direct  sum  of 
strictly  smaller  representations.  Whenever  a  representation  p  can  be  decom¬ 
posed  as  p  =  pi  ©  p2,  we  say  that  p  is  reducible.  As  an  example,  we  now  show 
that  the  first-order  permutation  representation  is  a  reducible  representation. 

Example  5.  Instead  of  using  the  standard  basis  vectors  {e\,e2,e3},  the  first- 
order  permutation  representation  T(2, i)  can  be  equivalently  written  with  respect 
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to  a  new  basis  {vi,V2,v3},  where: 


v\ 

V2 

V3 


ei  +  e2  +  e3 
|ei  +  e2  +  e3|  ’ 

-ei  +  e2 
|  -  ei  +  e2|’ 

— ei  —  e2  +  2e3 
|  -  ei  -  e2  +  2e3| ' 


To  ‘change  the  basis’,  we  write  the  new  basis  vectors  as  columns  in  a  matrix  C: 


r  -| 

'  1 

V2 

1  ' 

1  1  1 

v/3 

2 

V6 

Vl  V2  v3 

= 

1 

V3 

V2 

2 

1 

ye 

|  |  | 

1 

0 

2 

L  v/3 

VE  J 

and  conjugate  the  representation  772)i)  by  C  (as  in  Equation  4.2 )  to  obtain  the 
equivalent  representation  C_1  •  T( 2>1)(ct)  •  C : 


1 

0 

0  ' 

'  1 

0 

0  ' 

C~1r^1)(e)C  = 

0 

1 

0 

C-\, U)(1.2)C 

= 

0 

-1 

0 

_  0 

0 

1 

_  0 

0 

1 

1 

0 

0 

■ 

r 

1 

0 

0 

1 

C~1T{2t\)  (2,  3)C  = 

0 

1 

2 

V3 

2 

C'_1r(2,i)  (1,  3)C  = 

0 

1 

2 

2 

o 

Vs 

I 

n 

Vs 

1 

2 

2  J 

L 

2 

2 

J 

1 

0 

0 

' 

1 

0 

0 

1 

C_1r(2,1)(l,  2,  3)C  = 

0 

1 

2 

- 

v/3 

2 

C~\ 2,i)(l,3,2)C7  = 

0 

1 

2 

v/3 

2 

0 

2 

- 

1 

2 

0 

2 

- 

3  J 

The  interesting  property  of  this  particular  basis  is  that  the  new  representation 
matrices  all  appear  to  be  the  direct  sum  of  two  smaller  representations,  a  trivial 
representation,  p(3)  as  the  top  left  block,  and  a  degree  2  representation  in  the 
bottom  light  which  we  will  refer  to  as  P(2,i)- 

Geometrically,  the  representation  p(2,i)  can  a^so  be  thought  of  as  the  group 
of  rigid  symmetries  of  the  equilateral  triangle  with  vertices: 


'  x/3/2  ' 

Po  — 

'  -V3/2  ' 

Po  — 

0 

1/2 

7  i  2  — 

!/2 

5  r3  — 

-1 

The  matrix  p(2,i){  1,2)  acts  on  the  triangle  by  reflecting  about  the  x-axis,  and 
P(2,i)(l> 2, 3)  by  a  ir/3  counter-clockwise  rotation. 

In  general,  there  are  infinitely  many  reducible  representations.  For  example, 
given  any  dimension  d,  there  is  a  representation  which  maps  every  element  of  a 
group  G  to  the  d  x  d  identity  matrix  (the  direct  sum  of  d  copies  of  the  trivial 
representation).  However,  for  any  finite  group,  there  exists  a  finite  collection  of 
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atomic  representations  which  can  be  used  to  build  up  any  other  representation 
using  the  direct  sum  operation.  These  representations  are  referred  to  as  the 
irreducibles  of  a  group,  and  they  are  defined  simply  to  be  the  collection  of 
representations  (up  to  equivalence)  which  are  not  reducible.  It  can  be  shown 
that  any  representation  of  a  finite  group  G  is  equivalent  to  a  direct  sum  of 
irreducibles  (Diaconis,  1988),  and  hence,  for  any  representation  r,  there  exists 
a  matrix  C  for  which 

c^.T.c  =  00p, 

p  i=i 

where  p  ranges  over  all  distinct  irreducible  representations  of  the  group  G,  and 
the  inner  ®  refers  to  some  finite  number  ( zp )  of  copies  of  each  irreducible  p. 

As  it  happens,  there  are  only  three  irreducible  representations  of  S3  (Dia¬ 
conis,  1988),  the  trivial  representation  p(3),  the  degree  2  representation  P(2,i), 
and  the  alternating  representation  ppipp).  The  complete  set  of  irreducible  rep¬ 
resentation  matrices  of  S3  are  shown  in  the  Table  2.  Unfortunately,  the  analysis 
of  the  irreducible  representations  for  n  >  3  is  far  more  complicated  and  we 
postpone  this  more  general  discussion  for  Section  5. 

4.2  The  Fourier  transform 

The  link  between  group  representation  theory  and  Fourier  analysis  is  given  by 
the  celebrated  Peter-Weyl  theorem  ((Diaconis,  1988;  Terras,  1999;  Sagan,  2001)) 
which  says  that  the  matrix  entries  of  the  irreducibles  of  G  form  a  complete  set 
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of  orthogonal  basis  functions  on  G.3  The  space  of  functions  on  S3,  for  example, 
is  orthogonally  spanned  by  the  3!  functions  P(3)(cr),  [P(2,i)(c)]i,i,  [P( 2,1)  (o')]  1,2, 
[P(2,1) (cr)]2,i,  [P(2,1) (cr)]2,2  and  /E»(i,i,i) (cr) ,  where  [p(o t)]^  denotes  the  (i,j)  entry 
of  the  matrix  p(a). 

As  a  replacement  for  projecting  a  function  /  onto  a  complete  set  of  sinusoidal 
basis  functions  (as  one  would  do  on  the  real  line),  the  Peter-Weyl  theorem 
suggests  instead  to  project  onto  the  basis  provided  by  the  irreducibles  of  G.  As 
on  the  real  line,  this  projection  can  be  done  by  computing  the  inner  product 
of  /  with  each  element  of  the  basis,  and  we  define  this  operation  to  be  the 
generalized  form  of  the  Fourier  Transform. 

Definition  6.  Let  /  :  G  — »  R  be  any  function  on  a  group  G  and  let  p  be  any 
representation  on  G.  The  Fourier  Transform  of  /  at  the  representation  p  is 
defined  to  be  the  matrix  of  coefficients: 

fP  =  J2f(v)p(cr)-  (4-3) 

<7 

The  collection  of  Fourier  Transforms  at  all  irreducible  representations  of  G  form 
the  Fourier  Transform  of  f . 

There  are  two  important  points  which  distinguish  this  Fourier  Transform 
from  its  familiar  formulation  on  the  real  line  —  first,  the  outputs  of  the  transform 
are  matrix- valued,  and  second,  the  inputs  to  /  are  representations  of  G  rather 
than  real  numbers.  As  in  the  familiar  formulation,  the  Fourier  Transform  is 
invertible  and  the  inversion  formula  is  explicitly  given  by  the  Fourier  Inversion 
Theorem. 


Theorem  7  (Fourier  Inversion  Theorem). 


/(c) 


\G\ 


2^  dPx  Tr 
A 


flx  -Pa(c) 


(4.4) 


where  A  indexes  over  the  collection  of  irreducibles  ofG. 

Note  that  the  trace  term  in  the  inverse  Fourier  Transform  is  just  the  ‘ma¬ 
trix  dot  product’  between  fPx  and  p\(<r),  since  Tr  [ AT  ■  B~\  =  (vec(A),  vec(-B)), 
where  by  vec  we  mean  mapping  a  matrix  to  a  vector  on  the  same  elements 
arranged  in  column-major  order. 

We  now’  provide  several  examples  for  intuition.  For  functions  on  the  real  line, 
the  Fourier  Transform  at  zero  frequency  gives  the  DC  component  of  a  signal. 
The  same  holds  true  for  functions  on  a  group;  If  /  :  G  — >  R  is  any  function, 

technically  the  Peter-Weyl  result,  as  stated  here,  is  only  true  if  all  of  the  representation 
matrices  are  unitary.  That  is,  p(cr)*  p(a)  =  I  for  all  cr  6  Sn,  where  the  matrix  A*  is  the 
conjugate  transpose  of  A.  For  the  case  of  real-valued  (as  opposed  to  complex-valued)  matrices, 
however,  the  definitions  of  unitary  and  orthogonal  matrices  coincide. 

While  most  representations  are  not  unitary,  there  is  a  standard  result  from  representa¬ 
tion  theory  which  shows  that  for  any  representation  of  G,  there  exists  an  equivalent  unitary 
representation. 
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then  since  p(n)  =  1,  the  Fourier  Transform  of  /  at  the  trivial  representation 
is  constant,  with  fP{n)  =  //af{a).  Thus,  for  any  probability  distribution  P, 
we  have  PP(ra)  =1.  If  P  were  the  uniform  distribution,  then  Pp  =  0  at  every 
irreducible  p  except  at  the  trivial  representation. 

The  Fourier  Transform  at  also  has  a  simple  interpretation: 

[/r(  /M1  Mi)  =  *}  =  X!  /M 

crGSn  &(z:Sn  a:cr(j)=i 

The  set  Ay  =  {cr  :  a(j)  =  i}  is  the  set  of  the  (n  —  1)!  possible  permutations 
which  map  element  j  to  i.  In  identity  management,  Ay  can  be  thought  of  as 
the  set  of  assignments  which,  for  example,  have  Alice  at  Track  1.  If  P  is  a  distri¬ 
bution,  then  PT(n_1  y  is  a,  matrix  of  first-order  marginal  probabilities,  where  the 
(i,  j)-th  element  is  the  marginal  probability  that  a  random  permutation  drawn 
from  P  maps  element  j  to  i. 

Example  8.  Consider  the  following  probability  distribution  on  S3: 


a 

e 

(1,2) 

(2,3) 

(1,3) 

(1,2,3) 

(1,3,2) 

P(a) 

1/3 

1/6 

1/3 

0 

1/6 

0 

The  set  of  all  first  order  marginal  probabilities  is  given  by  the  Fourier  trans¬ 
om  at  T( 2,1).' 


A 

B 

C 

1 

2/3 

1/6 

1/6 

2 

1/3 

1/3 

1/3 

3 

0 

1/2 

!/2  . 

In  the  above  matrix,  each  column  j  represents  a  marginal  distribution  over  the 
possible  tracks  that  identity  j  can  map  to  under  a  random  draw  from  P.  We 
see,  for  example,  that  Alice  is  at  Track  1  with  probability  2/3,  or  at  Track  2 
with  probability  1/3.  Simultaneously,  each  row  i  represents  a  marginal  distri¬ 
bution  over  the  possible  identities  that  could  have  been  mapped  to  track  i  under 
a  random  draw  from  P.  In  our  example,  Bob  and  Cathy  are  equally  likely  to 
be  in  Track  3,  but  Alice  is  definitely  not  in  Track  3.  Since  each  row  and  each 
column  is  itself  a  distribution,  the  matrix  PT(2 1}  must  be  doubly  stochastic.  We 
will  elaborate  on  the  consequences  of  this  observation  later. 

The  Fourier  transform  of  the  same  distribution  at  all  irreducibles  is: 


P,,  =1,  P 


P(  3) 


Pf.  2.1) 


1/4  Vs/4  ' 
V^/4  1/4  J  ’ 


P 


Pfi.i.i) 


=  0. 


The  first-order  permutation  representation,  T(„_i)  1),  captures  the  statistics 
of  how  a  random  permutation  acts  on  a  single  object  irrespective  of  where  all 
of  the  other  n  —  1  objects  are  mapped,  and  in  doing  so,  compactly  summa¬ 
rizes  the  distribution  with  only  0(?r2)  numbers.  Unfortunately,  as  mentioned  in 
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Section  3,  the  Fourier  transform  at  the  first-order  permutation  representation 
cannot  capture  more  complicated  statements  like: 

P(Alice  and  Bob  occupy  Tracks  1  and  2)  =  0. 

To  avoid  collapsing  away  so  much  information,  we  might  define  richer  summary 
statistics  that  might  capture  ‘higher-order’  effects.  We  define  the  second-order 
unordered  permutation  representation  by: 

fan— 2,2)  fa)fa,j},{M}  =  1  M{M})  =  {i,  j}}  , 

where  we  index  the  matrix  row's  and  columns  by  unordered  pairs  The 

condition  inside  the  indicator  function  states  that  the  representation  captures 
whether  the  pair  of  objects  {k,£}  maps  to  the  pair  {i,  j},  but  is  indifferent  with 
respect  to  the  ordering;  i.e.,  either  k  i  and  i  i— >  j,  or,  k  i— >  j  and  1 1— >  i. 

Example  9.  Forn  =  4,  there  are  six  possible  unordered  pairs:  {1,  2},{1,  3},{1, 4}  ,{2, 3},{2, 4}, 
and  {3,4}.  The  matrix  representation  of  the  permutation  (1,2,3)  is: 

t(2,2)  (1,  2,  3)  = 


The  second  order  ordered  permutation  representation,  Ttn- 2,1,1) ,  is  defined 
similarly: 

fan-2,1,1)  (fafadMM)  =1M(M))  = 

wdiere  {k,£)  denotes  an  ordered  pair.  Therefore,  fan-2,1,1) icr)](i,j),(k,£)  is  1  if 
and  only  if  a  maps  k  to  i  and  £  to  j. 

As  in  the  first-order  case,  the  Fourier  transform  of  a  probability  distribu¬ 
tion  at  T(„_ 2,2);  returns  a  matrix  of  marginal  probabilities  of  the  form:  P(a  : 
a({k,£})  =  which  captures  statements  like,  "Alice  and  Bob  occupy 

Tracks  1  and  2  with  probability  1/2".  Similarly,  the  Fourier  transform  at 
r(n— 2,1,1)  returns  a  matrix  of  marginal  probabilities  of  the  form  P(er  :  a((k,£))  = 
( i,j )),  which  captures  statements  like,  "Alice  is  in  Track  1  and  Bob  is  in  Track 
2  with  probability  9/10". 

We  can  go  further  and  define  third-order  representations,  fourth-order  rep¬ 
resentations,  and  so  on.  In  general  how'ever,  the  permutation  representations 
as  they  have  been  defined  above  are  reducible,  intuitively  due  to  the  fact  that 
it  is  possible  to  recover  low'er  order  marginal  probabilities  from  higher  order 
marginal  probabilities.  For  example,  one  can  recover  the  normalization  con¬ 
stant  (corresponding  to  the  trivial  representation)  from  the  first  order  matrix 
of  marginals  by  summing  across  either  the  rows  or  columns,  and  the  first  order 
marginal  probabilities  from  the  second  order  marginal  probabilities  by  summing 
across  appropriate  matrix  entries.  To  truly  leverage  the  machinery  of  Fourier 
analysis,  it  is  important  to  understand  the  Fourier  transform  at  the  irreducibles 
of  the  symmetric  group,  and  in  the  next  section,  we  show  how  to  derive  the  ir¬ 
reducible  representations  of  the  Symmetric  group  by  first  defining  permutation 
representations,  then  “subtracting  off  the  knver-order  effects”. 
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5  Representation  theory  on  the  Symmetric  group 

In  this  section,  we  provide  a  brief  introduction  to  the  representation  theory 
of  the  Symmetric  group.  Rather  than  giving  a  fully  rigorous  treatment  of  the 
subject,  our  goal  is  to  give  some  intuition  about  the  kind  of  information  which 
can  be  captured  by  the  irreducible  representations  of  Sn.  Roughly  speaking, 
we  will  show  that  Fourier  transforms  on  the  Symmetric  group,  instead  of  being 
indexed  by  frequencies,  are  indexed  by  partitions  of  n  (tuples  of  numbers  which 
sum  to  n),  and  certain  partitions  correspond  to  more  complex  basis  functions 
than  others.  For  proofs,  we  point  the  reader  to  consult:  (Diaconis,  1989;  James 

6  Kerber,  1981;  Sagan,  2001;  Vershik  &  Okounkov,  2006). 

Instead  of  the  singleton  or  pairwise  marginals  which  were  described  in  the 
previous  section,  we  will  now  focus  on  using  the  Fourier  coefficients  of  a  distri¬ 
bution  to  ciuery  a  much  wider  class  of  marginal  probabilities.  As  an  example,  we 
will  be  able  to  compute  the  following  (more  complicated)  marginal  probability 
on  Sq  using  Fourier  coefficients: 


which  we  interpret  as  the  joint  marginal  probability  that  the  rows  of  the  diagram 
on  the  left  map  to  corresponding  rows  on  the  right  as  unordered  sets.  In  other 
words,  Equation  5.1  is  the  joint  probability  that  unordered  set  {1,  2, 3}  maps  to 
{1,2,6},  the  unordered  pair  {4,5}  maps  to  {4,5},  and  the  singleton  {6}  maps 
to  {3}. 

The  diagrams  in  Equation  5.1  are  known  as  Ferrer’s  diagrams  and  are  com¬ 
monly  used  to  visualize  partitions  of  n,  which  are  defined  to  be  unordered  tuples 
of  positive  integers,  A  =  (Ai, . . . ,  A^),  which  sum  to  n.  For  example,  A  =  (3,  2) 
is  a  partition  of  n  =  5  since  3  +  2  =  5.  Usually  we  write  partitions  as  weakly 
decreasing  sequences  by  convention,  so  the  partitions  of  n  =  5  are: 

(5),  (4,1),  (3,2),  (3,1,1),  (2,2,1),  (2, 1,1,1),  (1,1, 1,1,1), 


and  their  respective  Ferrers  diagrams  are: 


A  Young  tabloid  is  an  assignment  of  the  numbers  {l,...,n}  to  the  boxes  of 
a  Ferrers  diagram  for  a  partition  A,  where  each  row  represents  an  unordered 
set.  There  are  6  Young  tabloids  corresponding  to  the  partition  A  =  (2,2),  for 
example: 

1 1 1 2 1 

The  Young  tabloid,  \ME  ,  for  example,  represents  the  two  underordered  sets 
{1,2}  and  {3,4},  and  if  we  were  interested  in  computing  the  joint  probability 
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that  cr({  1 , 2} )  =  {3,4}  and  <r({3,4})  =  {1,2},  then  we  could  write  the  problem 
in  terms  of  Young  tabloids  as: 


In  general,  we  will  be  able  to  use  the  Fourier  coefficients  at  irreducible  repre¬ 
sentations  to  compute  the  marginal  probabilities  of  Young  tabloids.  As  we  shall 
see,  with  the  help  of  the  James  Submodule  theorem  (James  &  Kerber,  1981), 
the  marginals  corresponding  to  “simple”  partitions  will  require  very  few  Fourier 
coefficients  to  compute,  which  is  one  of  the  main  strengths  of  working  in  the 
Fourier  domain. 


Example  10.  Imagine  three  separate  rooms  containing  two  tracks  each ,  in 
which  Alice  and  Bob  are  in  room  1  occupying  Tracks  1  and  2;  Cathy  and  David 
are  in  room  2  occupying  Tracks  3  and  4;  and  Eric  and  Frank  are  in  room  3 
occupying  Tracks  5  and  6,  but  we  are  not  able  to  distinguish  which  person  is  at 
which  track  m  any  of  the  rooms.  Then 


It  is  in  fact,  possible  to  recast  the  first-order  marginals  which  were  described 
in  the  previous  section  in  the  language  of  Young  tabloids  by  noticing  that, 
for  example,  if  1  maps  to  1,  then  the  unordered  set  {2,  ...,n}  must  map  to 
{2,...,n}  since  permutations  are  one-to-one  mappings.  The  marginal  proba¬ 
bility  that  er(l)  =  1,  then,  is  equal  to  the  marginal  probability  that  cr(l)  =  1 
and  <j({ 2, . . . ,  n})  =  {2, . . . ,  n}.  If  n  =  6,  then  the  marginal  probability  written 
using  Young  tabloids  is: 


3  4  5  6 


3  4  5  6 


The  first-order  marginal  probabilities  correspond,  therefore,  to  the  marginal 
probabilities  of  Young  tabloids  of  shape  A  =  (n  —  1, 1). 

Likewise,  the  second-order  unordered  marginals  correspond  to  Young  tabloids 
of  shape  A  =  (n  —  2,  2).  If  n  =  6  again,  then  the  marginal  probability  that  {1,2} 
maps  to  {2,4}  corresponds  to  the  followfing  marginal  probability  for  tabloids: 


The  second-order  ordered  marginals  are  captured  at  the  partition  A  =  (n  — 
2,1,1).  For  example,  the  marginal  probability  that  {1}  maps  to  {2}  and  {2} 
maps  to  {4}  is  given  by: 
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And  finally,  we  remark  that  the  (1, . . . ,  1)  partition  of  n  recovers  all  original 
probabilities  since  it  asks  for  a  joint  distribution  over  <r(  1), . . .  ,a (n).  The  cor¬ 
responding  matrix  of  marginals  has  n!  x  n!  entries. 

To  see  how  the  marginal  probabilities  of  Young  tabloids  of  shape  A  can  be 
thought  of  as  Fourier  coefficients,  we  will  define  a  representation  (which  we  call 
the  permutation  representation )  associated  with  A  and  show  that  the  Fourier 
transform  of  a  distribution  at  a  permutation  representation  gives  marginal  prob¬ 
abilities.  We  begin  by  fixing  an  ordering  on  the  set  of  possible  Young  tabloids, 
{+},  {^2},  ■  •  • ,  and  define  the  permutation  representation  t\(<j)  to  be  the  matrix: 


[Tx(v)]ij 


1  if  <j{{tj})  =  {U} 
0  otherwise 


(5.2) 


It  can  be  checked  that  the  function  t\  is  indeed  a  valid  representation  of  the 
Symmetric  group,  and  therefore  we  can  compute  Fourier  coefficients  at  t\.  If 
P{a)  is  a  probability  distribution,  then 


P(<J)lT^(CT)]ij> 

<?eSn 

p 

P(cr  :  =  {ti}), 


and  therefore,  the  matrix  of  marginals  corresponding  to  Young  tabloids  of  shape 
A  is  given  exactly  by  the  Fourier  transform  at  the  representation  t\. 

As  we  showed  earlier,  the  simplest  marginals  (the  zeroth  order  normalization 
constant),  correspond  to  the  Fourier  transform  at  77n),  while  the  first-order 
marginals  correspond  to  T(n_U),  and  the  second-order  unordered  marginals 
correspond  to  T(ra_2j 2)-  The  list  goes  on  and  on,  with  the  marginals  getting  more 
complicated;  At  the  other  end  of  the  spectrum,  we  have  the  Fourier  coefficients 
at  the  representation  Tq  1  q  which  exactly  recover  the  original  probabilities 
P(a). 

We  use  the  word  ‘spectrum’  suggestively  here,  because  the  different  levels  of 
complexity  for  the  marginals  are  highly  reminiscent  of  the  different  frequencies 
for  real- valued  signals,  and  a  natural  question  to  ask  is  how  the  partitions  might 
be  ordered  with  respect  to  the  ‘complexity’  of  the  corresponding  basis  functions. 
In  particular  how  might  one  characterize  this  vague  notion  of  complexity  for  a 
given  partition? 

The  ‘correct’  characterization,  as  it  turns  out,  is  to  use  the  dominance  or¬ 
dering  of  partitions,  which,  unlike  the  ordering  on  frequencies,  is  not  a  linear 
order,  but  rather,  a  partial  order. 

Definition  11  (Dominance  Ordering).  Let  A,/i  be  partitions  of  n.  Then  A  >  p 
(we  say  A  dominates  p),  if  for  each  i,  X^l-=i  Afc  >  Y^ik= 1 

For  example,  (4,  2)l>(3,  2, 1)  since  4  >  3,  4+2  >  3+2,  and  4+2+0  >  3+2+1. 
However,  (3,3)  and  (4, 1, 1)  cannot  be  compared  with  respect  to  the  dominance 
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(a)  Dominance  ordering  for 
n  =  6. 
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(b)  Fourier  coefficient  matrices  for  Sq. 


Figure  3:  The  dominance  order  for  partitions  of  n  =  6  are  shown  in  the  left  dia¬ 
gram  (a).  Fat  Ferrer’s  diagrams  tend  to  be  higher  in  the  order  and  long,  skinny 
diagrams  tend  to  be  lower.  The  corresponding  Fourier  coefficient  matrices  for 
each  partition  (at  irreducible  representations)  are  shown  in  the  right  diagram 
(b).  Note  that  since  the  Fourier  basis  functions  form  a  complete  basis  for  the 
space  of  functions  on  the  Symmetric  group,  there  must  be  exactly  n\  coefficients 
in  total. 


ordering  since  3  <  4,  but  3  +  3  >  4  +  1.  The  ordering  over  the  partitions  of 
n  =  6  is  depicted  in  Figure  3(a). 

Partitions  with  fat  Ferrers  diagrams  tend  to  be  greater  (with  respect  to  dom¬ 
inance  ordering)  than  those  with  skinny  Ferrers  diagrams.  Intuitively,  represen¬ 
tations  corresponding  to  partitions  which  are  high  in  the  dominance  ordering 
are  ‘low  frequency’,  while  representions  corresponding  to  partitions  which  are 
low  in  the  dominance  ordering  are  ‘high  frequency’4. 

Having  defined  a  family  of  intuitive  permutation  representations  over  the 
Symmetric  group,  we  can  now  ask  whether  the  permutation  representations  are 
irreducible  or  not:  the  answer  in  general,  is  to  the  negative,  due  to  the  fact  that 

4The  direction  of  the  ordering  is  slightly  counterintuitive  given  the  frequency  interpretation, 
but  is  standard  in  the  literature. 
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it  is  often  possible  to  reconstruct  lower  order  marginals  by  summing  over  the 
appropriate  higher  order  marginal  probabilities.  However,  it  is  possible  to  show 
that,  for  each  permutation  representation  t\,  there  exists  a  corresponding  irre¬ 
ducible  representation  p\,  which,  loosely,  captures  all  of  the  information  at  the 
‘frequency’  A  which  was  not  already  captured  at  lower  frequency  irreducibles. 
Moreover,  it  can  be  shown  that  there  exists  no  irreducible  representation  besides 
those  indexed  by  the  partitions  of  n.  These  remarkable  results  are  formalized 
in  the  James  Submodule  Theorem,  which  we  state  here  without  proof  (see  (Di- 
aconis,  1988;  James  &  Kerber,  1981;  Sagan,  2001)). 

Theorem  12  (James’  Submodule  Theorem). 

1.  (Uniqueness)  For  each  partition,  A,  of  n,  there  exists  an  irreducible  rep¬ 
resentation ,  p\,  which  is  unique  up  to  equivalence. 

2.  (Completeness)  Every  irreducible  representation  of  Sn  corresponds  to  some 
partition  of  n. 

3.  There  exists  a  matrix  C\  associated  with  each  partition  A,  for  which 

KXfl 

c\  '  ta(ct)  •  C\  =  (J)  (J)  pM(cr),  for  all  a  G  Sn.  (5.3) 

M!>a  t=i 


4 ■  K\\  =  1  for  all  partitions  A. 

In  plain  English,  part  (3)  of  the  James  Submodule  theorem  says  that  we 
can  always  reconstruct  marginal  probabilities  of  A-tabloids  using  the  Fourier 
coefficients  at  irreducibles  which  lie  at  A  and  above  in  the  dominance  ordering,  if 
we  have  knowledge  of  the  matrix  C\  (which  can  be  precomputed  using  methods 
detailed  in  Appendix  B),  and  the  multiplicities  Kx^.  In  particular,  combining 
Equation  5.3  with  the  definition  of  the  Fourier  transform,  we  have  that 


fr\ 


Cx- 


Kx„ 

©  ©  fpe 

m!> a  t=i 


■C 


A  i 


(5.4) 


and  so  to  obtain  marginal  probabilities  of  A-tabloids,  we  simply  construct  a 
block  diagonal  matrix  using  the  appropriate  irreducible  Fourier  coefficients,  and 
conjugate  by  C\.  The  multiplicities  Kx^  are  known  as  the  Kostka  numbers  and 
can  be  computed  using  Young’s  rule  (Sagan,  2001).  To  illustrate  using  a  few 
examples,  we  have  the  following  decompositions: 


T(n)  —  P(n)i 

T(n-l,l)  =  P(n)  ©  P(ra-l,l)> 


T(n- 2,2)  =  P(n)  ©  P(n- 1,1)  ©  P(n- 2,2)  i 
T(n- 2,1,1)  —  P(n)  ©  P(n- 1,1)  ©  P(n- 1,1)  ©  P(n- 2,2)  ©  P(n- 2,1,1)) 
r(n— 3,3)  =  P(n)  ©  P(n-l,l)  ©  P(n- 2,2)  ©  P(n- 3,3)) 
r(ra-3,2,l)  =  P(n)  ©  P(n- 1,1)  ©  P(n-l,l)  ©  P(n- 2,2)  ©  P(n- 2,2) 

©  P(n— 2,1,1)  ©  P(n- 3,3)  ©  P(n- 3,2,1)- 


19 


Intuitively,  the  irreducibles  at  a  partition  A  reflect  the  “pure”  At,l-order  effects 
of  the  underlying  distribution.  In  other  words,  the  irreducibles  at  A  form  a 
basis  for  functions  that  have  “interesting”  Ai?l-order  marginal  probabilities,  but 
uniform  marginals  at  all  partitions  p  such  that  p  \>  A. 

Example  13.  As  an  example,  we  demonstrate  a  “preference”  function  which  is 
“purely”  second-order  (unordered)  in  the  sense  that  its  Fourier  coefficients  are 
equal  to  zero  at  all  irreducible  representations  except  P(n- 2,2)  (and  the  trivial 
representation) .  Consider  the  function  f  :  Sn  — >  R  defined  by: 


/(c) 


1  if  |cr(l)  —  <t(2)|  =  1  (modn) 
0  otherwise 


Intuitively,  imagine  seating  n  people  at  a  round  table  with  n  chairs,  but  with 
the  constraint  that  the  first  two  people,  Alice  and  Bob,  are  only  happy  if  they 
are  allowed  to  sit  next  to  each  other.  In  this  case,  f  can  be  thought  of  as  the 
indicator  function  for  the  subset  of  seating  arrangements  (permutations)  which 
make  Alice  and  Bob  happy. 

Since  f  depends  only  on  the  destination  of  the  unordered  pair  {1,2},  its 
Fourier  transform  is  zero  at  all  partitions  p  such  that  p  <  (n—  2,  2)  ( f f  =  0).  On 
the  other  hand,  Alice  and  Bob  have  no  individual  preferences  for  seating,  so  the 
first-order  “ marginals ”  of  f  are  uniform,  and  hence,  /(„_  14)  =  0.  The  Fourier 
coefficients  at  irreducibles  can  be  obtained  from  the  second-order  (unordered) 
“marginals”  using  Equation  5.3. 


The  sizes  of  the  irreducible  representation  matrices  are  typically  much  smaller 
than  their  corresponding  permutation  representation  matrices.  In  the  case  of 
A  =  (1, . . . ,  1)  for  example,  dim  t\  =  n\  while  dim  p\  =  1.  There  is  a  sim¬ 
ple  combinatorial  algorithm,  known  as  the  Hook  Formula  (Sagan,  2001),  for 
computing  the  dimension  of  p\.  While  we  do  not  discuss  it,  we  provide  a  few 
dimensionality  computations  here  (Table  3)  to  facilitate  a  dicussion  of  complex¬ 
ity  later.  See  Figure  3(b)  for  an  example  of  what  the  matrices  of  a  complete 
Fourier  transform  on  Sq  would  look  like. 

In  practice,  since  the  irreducible  representation  matrices  are  determined  only 
up  to  equivalence,  it  is  necessary  to  choose  a  basis  for  the  irreducible  representa¬ 
tions  in  order  to  explicitly  construct  the  representation  matrices.  As  in  (Kondor 
et  ah,  2007),  we  use  the  Gelfand-Tsetlin  basis  which  has  several  attractive  prop¬ 
erties,  two  advantages  being  that  the  matrices  are  real-valued  and  orthogonal. 
See  Appendix  A  for  details  on  constructing  irreducible  matrix  representations 
with  respect  to  the  Gel’fand-Tsetlin  basis. 
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A 

(n) 

(n  -  1, 1) 

in-  2,2) 

(n-2,1,1) 

(n  -  3,  3) 

(n-3,2,1) 

dim  p\ 

1 

n  —  1 

n(n — 3) 

(n- l)(n— 2) 

n(n —  1)  (n — 5) 

1 

1 

2 

2 

6 

3 

Table  3:  Dimensions  of  low-order  irreducible  representation  matrices. 


6  Inference  in  the  Fourier  domain 

What  we  have  shown  thus  far,  is  that  there  is  a  principled  method  for  compactly 
summarizing  distributions  over  permutations  based  on  the  idea  of  bandlimiting 
—  saving  only  the  low-frequency  terms  of  the  Fourier  transform  of  a  function, 
which,  as  we  discussed,  is  equivalent  to  maintaining  a  set  of  low-order  marginal 
probabilities.  We  now  turn  to  the  problem  of  performing  probabilistic  inference 
using  our  compact  summaries.  One  of  the  main  advantages  of  viewing  marginals 
as  Fourier  coefficients  is  that  it  provides  a  natural  principle  for  formulating 
inference,  which  is  to  rewrite  all  inference  related  operations  with  respect  to  the 
Fourier  domain. 

The  idea  of  bandlimiting  a  distribution  is  ultimately  moot,  however,  if  it  be¬ 
comes  necessary  to  transform  back  to  the  primal  domain  each  time  an  inference 
operation  is  called.  Naively,  the  Fourier  Transform  on  S„  scales  as  0((n!)2),  and 
even  the  fastest  Fast  Fourier  Transforms  for  functions  on  Sn  are  no  faster  than 
0(?t2  •  n\)  (see  (Maslen,  1998)  for  example).  To  resolve  this  issue,  we  present  a 
formulation  of  inference  which  operates  solely  in  the  Fourier  domain,  allowing 
us  to  avoid  a  costly  transform.  We  begin  by  discussing  exact  inference  in  the 
Fourier  domain,  which  is  no  more  tractable  than  the  original  problem  because 
there  are  n!  Fourier  coefficients,  but  it  will  allow  us  to  introduce  the  bandlim¬ 
iting  approximation  in  the  next  section.  There  are  two  operations  to  consider: 
prediction/rollup,  and  conditioning.  The  assumption  for  the  rest  of  this  sec¬ 
tion  is  that  the  Fourier  transforms  of  the  transition  and  observation  models  are 
known.  We  discuss  methods  for  obtaining  the  models  in  Section  8. 

6.1  Fourier  prediction/rollup 

We  will  consider  one  particular  class  of  transition  models  —  that  of  random 
walks  over  a  group,  which  assumes  that  cr(t+1)  is  generated  from  by  draw¬ 
ing  a  random  permutation  from  some  distribution  and  setting  crh' l_1)  = 

TrWfjW  s_  jn  our  identity  management  example,  represents  a  random  iden¬ 
tity  permutation  that  might  occur  among  tracks  when  they  get  close  to  each 

°We  place  n  on  the  left  side  of  the  multiplication  because  we  want  it  to  permute  tracks 
and  not  identities.  Had  we  defined  tt  to  map  from  tracks  to  identities  (instead  of  identities  to 
tracks),  then  it  would  be  multiplied  from  the  right.  Besides  left  versus  right  multiplication, 
there  are  no  differences  between  the  two  conventions. 
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other  (what  we  call  a  mixing  event).  For  example,  Q(  1,2)  =  1/2  means  that 
Tracks  1  and  2  swapped  identities  with  probability  1/2.  The  random  walk 
model  also  appears  in  many  other  applications  such  as  modeling  card  shuffles 
(Diaconis,  1988). 

The  motivation  behind  the  random  walk  transition  model  is  that  it  allows 
us  to  write  the  prediction /rollup  operation  as  a  convolution  of  distributions  on 
the  Symmetric  group.  The  extension  of  the  familiar  notion  of  convolution  to 
groups  simple  replaces  additions  and  subtractions  by  analogous  group  operations 
(function  composition  and  inverse,  respectively): 


Definition  14.  Let  Q  and  P  be  probability  distributions  on  Sn.  Define  the 
convolution 6  of  Q  and  P  to  be  the  function  [Q  *  P]  (ay)  =  )P{a2). 

Using  Definition  14,  we  see  that  the  prediction/rollup  step  can  be  written 
as: 


P{a(t+l))  =  Ep(  <7 {t+1)\a{t))  ■  P{a(t)), 


j(*) 


E  QW(nW)-P(a(% 

{(crt*)  ,7r(*) )  :  cr(t+1)  =7r (*)  * crC*)  } 

(Right-multiplying  both  sides  of  o-^+1)  =  7 

by  (o-(t))_1,  we  see  that  7r®  can  be  replaced  by  cr^+1^(o-^^)  1), 

=  Q(t\v(t+1)  ■  (^(t))_1)  •  P{<J(t)), 


j(*) 


=  *p|  (CT(t+1)). 


As  with  Fourier  transforms  on  the  real  line,  the  Fourier  coefficients  of  the  con¬ 
volution  of  distributions  P  and  Q  on  groups  can  be  obtained  from  the  Fourier 
coefficients  of  P  and  Q  individually,  using  the  convolution  theorem  (see  also 
(Diaconis,  1988)): 

Proposition  15  (Convolution  Theorem).  Let  Q  and  P  be  probability  distribu¬ 
tions  on  Sn.  For  any  representation  p, 


=  QP  ■  PP, 

p 

where  the  operation  on  the  right  side  is  matrix  multiplication. 

Therefore,  assuming  that  the  Fourier  transforms  Pp ^  and  Qp'*  are  given,  the 
prediction/rollup  update  rule  is  simply: 

p(t+ !)  qW  .  p(*)_ 

®Note  that  this  definition  of  convolution  on  groups  is  strictly  a  generalization  of  convolution 
of  functions  on  the  real  line,  and  is  a  non-commutative  operation  for  non-abelian  groups.  Thus 
the  distribution  P  *  Q  is  not  necessarily  the  same  as  Q  *  P. 
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G 

P(°) 

Q(1) 

pi1) 

q(2> 

p(  2) 

e 

1 

3/4 

3/4 

3/4 

9/16 

(1.2) 

0 

1/4 

1/4 

0 

3/16 

(2.3) 

0 

0 

0 

0 

0 

(1.3) 

0 

0 

0 

1/4 

3/16 

(1.2.3) 

0 

0 

0 

0 

1/16 

(1.3.2) 

0 

0 

0 

0 

0 

Table  4:  Primal  domain  prediction/rollup  example. 


p(°) 

Q(1) 

pi1) 

q(2) 

pi2) 

P(3) 

P( 2,1) 

P(i.i.i) 

1 

li°] 

1 

2 

1 

1  x/3  I 

2  8 

VS  5 

8  8  J 

1 

2 

1 

r  7  Vs  i 

16  8 

VS  5 

L  16  8  J 

1 

2 

Table  5:  Fourier  domain  prediction/rollup  example. 


Note  that  the  update  only  requires  knowledge  of  P  and  does  not  require  P. 
Furthermore,  the  update  is  pointwise  in  the  Fourier  domain  in  the  sense  that 
the  coefficients  at  the  representation  p  affect  P/+11  only  at  p.  Consequently, 
prediction/ rollup  updates  in  the  Fourier  domain  never  increase  the  representa¬ 
tional  complexity.  For  example,  if  we  maintain  third-order  marginals,  then  a 
single  step  of  prediction/rollup  called  at  time  t  returns  the  exact  third-order 
marginals  at  time  t  +  1,  and  nothing  more. 

Example  16.  We  run  the  prediction/rollup  routines  on  the  first  two  time 
steps  of  the  example  in  Figure  2,  first  in  the  primal  domain,  then  in  the  Fourier 
domain.  At  each  mixing  event,  two  tracks,  i  and  j ,  swap  identities  with  some 
probability.  Using  a  mixing  model  given  by: 

(  3/4  if  7r  =  e 
QW  =  |  1/4  ifn  =  (i,j)  , 

[  0  otherwise 

we  obtain  results  shown  in  Tables  f  and  5. 

6.1.1  Limitations  of  random  walk  models 

While  the  random  walk  assumption  captures  a  rather  general  family  of  transition 
models,  there  do  exist  certain  models  which  cannot  be  written  as  a  random  walk 
on  a  group.  In  particular,  one  limitation  is  that  the  prediction/rollup  update  for 
a  random  walk  model  can  only  increase  the  entropy  of  the  distribution.  As  with 
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Kalman  filters,  localization  is  thus  impossible  without  making  observations.  7 
(Shin  et  al.,  2005)  show  that  the  entropy  must  increase  for  a  certain  kind  of 
random  walk  on  Sn  (where  n  could  be  either  the  identity  or  the  transposition 
(■ i,j )),  but  in  fact,  the  result  is  easily  generalized  for  any  random  walk  mixing 
model  and  for  any  group. 

Proposition  17. 

H  P(t+1)(<7(t+1))  >  max  |  H  QW(tW)  ,H  P(t)(cr(t))  j, 

where  H  [P(<r)]  denotes  the  statistical  entropy  functional,  H[P(a)]  =  —  J^o-eG  P(a)  logP(cr). 
Proof.  We  have: 

p(t+1)(fj(t+1))  =  | *  P(t)  (cr(t+1)) 

=  Q(°{t+1)  ■  ^(t))_1)p(tV(t)) 

crC*) 

Applying  the  Jensen  Inequality  to  the  entropy  function  (which  is  concave)  yields: 

H  |p(*+1)(ct(*+1))1  >  J2p(t)(c r(t))H  [<2(‘V  '  (<j(t))"1)l  .  (Jensen’s  inequality) 

=  ^P^((T^)P  ct)  ,  (translation  invariance  of  entropy) 

<t(‘) 

=  H  [q<*V)]  ,  (since  £a(f)  P(t)(^(t))  =  !)• 

The  proof  that  H  [p(t+1)(0p+1))]  >  H  [pW  (ct*-4-*)]  is  similar  with  the  exception 
that  we  must  rewrite  the  convolution  so  that  the  sum  ranges  over  r® . 

P(t+i)(CT(t+i))  =  [gW  (CT(*+D), 

=  ^  gW(rW)p(t)((rW)-1  •  a(t+1)). 

t(*) 

□ 

Example  18.  This  example  is  based  on  one  from  (Diaconis,  1988).  Consider 
a  deck  of  cards  numbered  {1, . . . ,  n}.  Choose  a  random  permutation  of  cards  by 

7In  general,  if  we  are  not  constrained  to  using  linear  Gaussian  models,  it  is  possible  to 
localize  with  no  observations.  Consider  a  robot  walking  along  the  unit  interval  on  the  real  line 
(which  is  not  a  group).  If  the  position  of  the  robot  is  unknown,  one  easy  localization  strategy 
might  be  to  simply  drive  the  robot  to  the  right,  with  the  knowledge  that  given  ample  time, 
the  robot  will  slam  into  the  ‘wall’,  at  which  point  it  will  have  been  localized.  With  random 
walk  based  models  on  groups  however,  these  strategies  are  impossible  —  imagine  the  same 
robot  walking  around  the  unit  circle  —  since,  in  some  sense,  the  group  structure  prevents  the 
existence  of  ‘walls’. 
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Figure  4:  We  start  with  a  deck  of  cards  in  sorted  order,  and  perform  fifteen 
consecutive  shuffles  according  to  the  rule  given  in  Equation  6.1.  The  plot  shows 
the  entropy  of  the  distribution  over  permutations  with  respect  to  the  number 
of  shuffles  for  n  =  3,4,  ...,8.  When  H(P)/\og(n\)  =  1,  the  distribution  has 
become  uniform. 


first  picking  two  cards  independently,  and  swapping  ( a  card  might  be  swapped 
with  itself),  yielding  the  following  probability  distribution  over  Sn: 

(  i  if  7T  =id 

Q(n)  =  <  ^  if  n  is  a  transposition  .  (6.1) 

[  0  otherwise 

Repeating  the  above  process  for  generating  random  permutations  tt  gives  a 
transition  model  for  a  hidden  Markov  model  over  the  symmetric  group.  We  can 
also  see  (Figure  4)  that  the  entropy  of  the  deck  increases  monotonically  with 
each  shuffle,  and  that  repeated  shuffles  with  Q(tt)  eventually  bring  the  deck  to 
the  uniform  distribution. 

6.2  Fourier  conditioning 

In  contrast  with  the  prediction/rollup  operation,  conditioning  can  potentially 
increase  the  representational  complexity.  As  an  example,  suppose  that  we  know 
the  following  first-order  marginal  probabilities: 

P(Alice  is  at  Track  1  or  Track  2)  =  .9,  and 

P(Bob  is  at  Track  1  or  Track  2)  =  .9. 

If  we  then  make  the  following  first-order  observation: 

P(Cathy  is  at  Track  1  or  Track  2)  =  1, 
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then  it  can  be  inferred  that  Alice  and  Bob  cannot  both  occupy  Tracks  1  and  2 
at  the  same  time,  i.e., 

P({Alice,Bob}  occupy  Tracks  {1,2})  =  0, 

demonstrating  that  after  conditioning,  we  are  left  with  knowledge  of  second- 
order  (unordered)  marginals  despite  the  fact  that  the  prior  and  likelihood  func¬ 
tions  were  only  known  up  to  first-order.  Intuitively,  the  example  shows  that 
conditioning  “smears”  information  from  low-order  Fourier  coefficients  to  high- 
order  coefficients,  and  that  one  cannot  hope  for  a  pointwise  operation  as  was 
afforded  by  prediction/rollup.  We  now  show  precisely  how  irreducibles  of  dif¬ 
ferent  complexities  “interact”  with  each  other  in  the  Fourier  domain  during 
conditioning. 

An  application  of  Bayes  rule  to  find  a  posterior  distribution  P(a\z)  after 
observing  some  evidence  2  requires  two  steps:  a  pointwise  product  of  likelihood 
P{z\a)  and  prior  P(cr),  followed  by  a  normalization  step: 

P{a\z)  =  r]  ■  P(z\a)  •  P(cr). 

For  notational  convenience,  we  will  refer  to  the  likelihood  function  as  L(z\a) 
henceforth.  We  showed  earlier  that  the  normalization  constant  rj -1  =  L(z\a)- 

P(ct)  is  given  by  the  Fourier  transform  of  L(*)p(t)  at  the  trivial  representation 
—  and  therefore  the  normalization  step  of  conditioning  can  be  implemented  by 
simply  dividing  each  Fourier  coefficient  by  the  scalar  LWpW 

L  -I  P(„) 

The  pointwise  product  of  two  functions  /  and  g,  however,  is  trickier  to 
formulate  in  the  Fourier  domain.  For  functions  on  the  real  line,  the  pointwise 
product  of  functions  can  be  implemented  by  convolving  the  Fourier  coefficients 
of  /  and  g,  and  so  a  natural  question  is:  can  we  apply  a  similar  operation  for 
functions  over  general  groups?  Our  answer  to  this  is  that  there  is  an  analogous 
(but  more  complicated)  notion  of  convolution  in  the  Fourier  domain  of  a  general 
finite  group.  We  present  a  convolution-based  conditioning  algorithm  which  we 
call  Kronecker  Conditioning ,  which,  in  contrast  to  the  pointwise  nature  of  the 
Fourier  Domain  prediction/rollup  step,  and  much  like  convolution,  smears  the 
information  at  an  irreducible  pv  to  other  irreducibles. 

6.2.1  Fourier  transform  of  the  pointwise  product 

Our  approach  to  computing  the  Fourier  transform  of  the  pointwise  product  in 
terms  of  /  and  g  is  to  manipulate  the  function  f(a)g(a)  so  that  it  can  be  seen 
as  the  result  of  an  inverse  Fourier  transform  (Equation  4.4).  Hence,  the  goal 
will  be  to  find  matrices  Rv  (as  a  function  of  f,g)  such  that  for  any  a  £  G, 

f{°)  -9{°)  =  T^|  ^d^Tr  (r£  ■  p„(a))  ,  (6.2) 

1  1  V 

after  which  we  will  be  able  to  read  off  the  Fourier  transform  of  the  pointwise 

product  as  fg  =  Rv. 

L  -I  pv 
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1.  If  A  and  B  are  square,  Tr  (A  (g>  £?)  =  (TrA)  •  (Tr B). 

2.  (A  <g>  B)  ■  (C  ®  D)  =  AC  ®  BD. 

3.  Let  A  be  an  n  x  n  matrix,  and  C  an  invertible  nx  n  matrix.  Then 
TrA  =  Tv  (C-1  AC). 

4.  Let  A  be  an  nxn  matrix  and  II,  be  matrices  of  size  irq  x  irq  where 
X)i  mi  =  n ■  Then  Tr  (A  •  (®i  Bi ))  =  JT  Tr  (A*  •  Bi),  where  A;  is 
the  block  of  A  corresponding  to  block  Bi  in  the  matrix  (®}  Bi). 


Table  6:  Matrix  Identities  used  in  Proposition  19. 


For  any  a  €  G  we  can  write  the  pointwise  product  in  terms  /  and  g  using 
the  inverse  Fourier  transform: 


/0)  •  g(o r)  = 


1 

W\ 


-P&j)  ■  J P^)) 


1 

\G\ 


dp\ 

A  ,/x 


Tr 


(/£  •  ^(CT))  ■  Tr  (pTPp  •  M*7))  •  (6-3) 


Now  we  want  to  manipulate  this  product  of  traces  in  the  last  line  to  be  just  one 
trace  (as  in  Equation  6.2),  by  appealing  to  some  properties  of  the  Kronecker 
Product.  The  Kronecker  product  of  an  nxn  matrix  U  =  ( Uij )  by  an  m  x  m 
matrix  V,  is  defined  to  be  the  nm  x  nm  matrix 


(  ui,iV 

Ml, 2V  .  . 

■  ui,nV 

u2,iV 

W2,2  V 

■  U2,nV 

\  unpV 

Un^V  .  . 

We  summarize  some  important  matrix  properties  in  Table  6.  The  connection 
to  our  problem  is  given  by  matrix  property  1.  Applying  this  to  Equation  6.3, 
we  have: 

Tr  (fjx  ■  PaM)  '  TV  =  Tr  ((/£  •  pa(<t))  ® 

=  Tr  ((ik  ®  9Pp)  ■  (pa(ct)  ®  Pp(a))^  , 


where  the  last  line  follows  by  Property  2.  The  term  on  the  left,  fPx  <S>  gp ,  is  a 
matrix  of  coefficients.  The  term  on  the  right,  p\{a)  ®  Pp(cr),  itself  happens  to 
be  a  representation,  called  the  Kronecker  (or  Tensor)  Product  Representation. 
In  general,  the  Kronecker  product  representation  is  reducible,  and  so  it  can 
decomposed  into  a  direct  sum  of  irreducibles.  In  particular,  if  p\  and  are 
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any  two  irreducibles  of  G,  there  exists  a  similarity  transform  C\ M  such  that,  for 
any  a  £  G, 

4  •  [pa  ®  Pm]  (ct)  '  Cam  =  ©  ©  P^(cr).  (6.4) 

V  i= 1 

The  ©  symbols  here  refer  to  a  matrix  direct  sum  as  in  Equation  2,  v  indexes 
over  all  irreducible  representations  of  Sn ,  while  i  indexes  over  a  number  of  copies 
of  pv  which  appear  in  the  decomposition.  We  index  blocks  on  the  right  side  of 
this  equation  by  pairs  of  indices  {v,  £) .  The  number  of  copies  of  each  pv  (for  the 
tensor  product  pair  p\  ©  p^)  is  denoted  by  the  integer  z\^,  the  collection  of 
which,  taken  over  all  triples  (X,p,u),  are  commonly  referred  to  as  the  Clebsch- 
Gordan  series.  Note  that  we  allow  the  zxpu  to  be  zero,  in  which  case  pu  does 
not  contribute  to  the  direct  sum.  The  matrices  C\ M  are  known  as  the  Clebsch- 
Gordan  coefficients.  The  Kronecker  Product  Decomposition  problem  is  that  of 
finding  the  irreducible  components  of  the  Kronecker  product  representation,  and 
thus  to  find  the  Clebsch-Gordan  series/coefficients  for  each  pair  of  irreducible 
representations  ( px,p p). 

Decomposing  the  Kronecker  product  inside  Equation  6.4  using  the  Clebsch- 
Gordan  series  and  coefficients  yields  the  desired  Fourier  transform,  which  we 
summarize  in  the  form  of  a  proposition.  In  the  case  that  /  and  g  are  defined 
over  an  abelian  group,  then  the  following  formulas  reduce  to  the  familiar  form 
of  convolution. 


Proposition  19.  Let  f,g  be  the  Fourier  transforms  of  functions  f  and  g  re¬ 
spectively,  and  for  each  ordered  pair  of  irreducibles  (px,pp),  define:  Ax^  = 

©  9p^\  ■  Cxp-  Then  the  Fourier  tranform  of  the  pointwise  product  fg 


is: 


1 

dM 


El  dp\  dPli 

A  fi 


E 

e=i 


> 


(6.5) 


where  A^fP  is  the  block  of  Ax^  corresponding  to  the  {v,£)  block  in  ®^I/  pv 
from  Equation  6.f. 


Proof.  We  use  the  fact  that  Cxp  is  an  orthogonal  matrix  for  all  pairs  {px,Pp), 

Cl,  •  (\,  =  /• 


f(A)  -g(cr) 


(by  Property  1) 


(by  Property  2) 


rfPx  Tr  (/JA  '  PA(cr))  ■  J2  dPp  Tr  '  Pr  (*)) 


|G| 


\G\ 


F|)  52  dPxdPp  [tt  (fpx  ■  PmM)  ■  Tr  (sp„  '  Pm(ct)) 


A,/x 


1  \  2 

jg|  )  52  dPxdPp  [Tr  ((/Jx  •  Pa(o-))  ®  •  Pp{c 


m)' 


A  ,/x 


52dP*dPeTl  ((E  ®9p2)  •  (Pa(o-)  ®  Pm(o-)) 


A  ,pi 
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(by  Property  3) 


(by  definition  of  C\ ^  and  A\ 


(by  Property  4) 


(rearranging  terms) 


(j^l)  E  dp\dp^Tr  (©v  '  (. fpx  ®  9pp)  '  ©V 
•©am  ■  (pa(o-)  ®  M°0)  ■  ©am) 

(|q|)  EdeA<WTr  ©  M°0^ 


1 

l©P 


zXfj,u 


'52dpxdpp^2dpv  E  ^ 


A/x 


A 


C/) 

A/x 


1 


EdC"Tr 


EE 


A/z 


£=1 


dp\dp it  ^(i/,€) 

<*p„|G| 


T 

Pv{a) 


Recognizing  the  last  expression  as  an  inverse  Fourier  transform  completes  the 
proof.  □ 


The  Clebsch-Gordan  series,  z\^u,  plays  an  important  role  in  Equation  6.5, 
which  says  that  the  ( p\,Pp, )  crossterm  contributes  to  the  pointwise  product  at 
p„  only  when  Zx^  >  0.  In  the  simplest  case,  we  have  that 


f  1  if  fi  =  v 
\  0  otherwise  ’ 


which  is  true  since  p(n)(cr)  =  1  for  all  a  G  Sn.  As  another  example,  it  is  known 
that: 

P(n-l,l)  ®  P(n- 1,1)  =  P(n)  ©  P(n-l,l)  ©  P(n- 2,2)  ©  P(n- 2,1,1);  (6.6) 

or  equivalently, 

_  f  1  if  v  is  one  of  ( n),(n  —  1,  l),(n  —  2,  2),  or  (n  —  2, 1, 1) 
z(n-i,i),(n-i,i),v  -  <  0  otherwise 


So  if  the  Fourier  transforms  of  the  likelihood  and  prior  are  zero  past  the  first 
two  irreducibles  ((n)  and  (n  —  1, 1)),  then  a  single  conditioning  step  results  in  a 
Fourier  transform  which,  in  general,  carries  second-order  information  at  (n—  2,  2) 
and  (n  —  2, 1, 1),  but  is  guaranteed  to  be  zero  past  the  first  four  irreducibles  (n), 
(n  —  1, 1),  (n  —  2,  2)  and  (n  —  2, 1, 1). 

As  far  as  we  know,  there  are  no  analytical  formulas  for  finding  the  entire 
Clebsch-Gordan  series  or  coefficients,  and  in  practice,  these  computations  do  in 
fact  take  a  long  time.  We  emphasize  however,  that  as  fundamental  constants 
related  to  the  irreducibles  of  the  Symmetric  group,  they  need  only  be  computed 
once  (like  the  digits  of  7 r,  for  example)  and  can  be  stored  in  a  table  for  all 
future  reference.  For  a  detailed  discussion  of  techniques  for  computing  the 
Clebsch-Gordan  series/coefficients,  see  Appendix  B.  We  plan  to  make  a  set 
of  precomputed  coefficients  available  on  the  web,  but  for  now  we  will  assume 
throughout  the  rest  of  the  paper  that  both  the  series  and  coefficients  have  been 
made  available  as  a  lookup  table.  We  conclude  our  section  on  inference  with  a 
fully  worked  example  of  Kronecker  conditioning. 
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Example  20.  For  this  example,  refer  to  Table  2  for  the  representations  of  S3. 
Given  functions  /,  g  :  S3  — >  R;  we  will  compute  the  Fourier  transform  of  the 
pointwise  product  f  ■  g. 

Since  there  are  three  irreducibles,  there  are  nine  tensor  products  p\  (g>  p M 
to  decompose ,  six  of  which  are  trivial  either  because  they  are  one-dimensional, 
or  involve  tensoring  against  the  trivial  representation.  The  nontrivial  tensor 
products  to  consider  are  P{2,i)  <8>  P(  1,1,1),  P{  1,1,1)  <E>  p( 2,1)  and  p(2,i)  ®  /9(2,i)-  The 
Clebsch-Gordan  series  for  the  nontrivial  tensor  products  are: 


3(2,1), (l,l,l),i/ 

3(1, 1,1),  (2,1),!/ 

3(2,1), (2,1),!/ 

CO 

II 

0 

0 

1 

"  =  (2,1) 

1 

1 

1 

n  —  (1,1,1) 

0 

0 

1 

The  Clebsch-Gordan  coefficients  for  the  nontrivial  tensor  products  are  given 
by  the  following  orthogonal  matrices: 


C , 


(2, 1,1) 


0  1 

-1  0 


C(2, 1)0(2, 1)  - 


As  in  Proposition  19,  define: 


V2 


0(1,1,  1)0(2, 1)  — 

10-10 
0-10  1 
0-1  0  -1 
10  10 


0  -1 

1  0 


A 


(2, 1)0(1, 1,1)  —  C'5,1)0(1,1,1)  (/(2,1)  ®  5(1,1 


») 


c, 


(2, 1)0(1, 1,1)) 


24(1,1,1)0(2,1)  —  C(l,l,  1)0(2, 1)  (7(1,  1,1)  ®  5(2,1))  <?(!,!, 1)0(2, 1), 


A, 


—  C(2, 1)0(2, 1)  (7(2,1)  ®  5(2,1))  C(2, 1)0(2, 1), 


+2, 1)0(2, 1) 

T/ien  Proposition  19  gives  the  following  formulas: 
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/  ■  5 


P(2,l) 


/•5 


p(i,i,i) 


1 

3! 

1 

3! 


/p<3)  •  Pp(3)  +  /p(l,l,l)  '  9  P  (1,1,1)  +  4  •  [-4(2, 1)0(2, 1)]  M 
/p< 2,1)  •  ffp(3)  +  /p(3)  •  9 P(2, 1)  +  24(1,1,1)0(2,1) 
+24(2,1)0(1,1,1)  +2-  [-4(2, 1)0(2, 1)]2;3)2:3  > 

/p( 3)  ■  Pp(!,i,i)  +/p(i,i,i)  '  5p(3)  +  4'  [-4(2, 1)0(2, l)]4j4 


(6.7) 

(6.8) 

(6.9) 

(6.10) 

(6.11) 

(6.12) 


where  the  notation  [A]a:b,c:d  denotes  the  block  of  entries  in  A  between  rows  a 
and  b,  and  between  columns  c  and  d  (inclusive). 

Using  the  above  formulas,  we  can  continue  on  Example  16  and  compute  the 
last  update  step  in  our  identity  management  problejn  (Figure  2).  At  the  final 
time  step,  we  observe  that  Bob  is  at  track  1  with  100%  certainty.  Our  likelihood 
function  is  therefore  nonzero  only  for  the  permutations  which  map  Bob  (the 
second  identity)  to  the  first  track: 


L(a)  oc 


1  if  a  =  (1,2)  or  (1,3, 2) 
0  otherwise 
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Algorithm  1:  Pseudocode  for  the  Fourier  Prediction/Rollup  Algorithm. 
PredictionRollup 
input  :  Q^pl  and  Pp\,  Pa  G  A 
output:  Pp*+1\  p\  €  A 
1  foreach  p\  £  A  do  Pp(+1)  <—  Qpl  ■  ; 


The  Fourier  transform  of  the  likelihood  function  is: 

Lp(  3)  2,  ip(2j  i) 

Plugging  the  Fourier  transforms  of  the  prior  distribution  (PC)  from  Table  5) 
and  likelihood  (Equation  6.13)  into  Equations  6.1,  6.8,  6.9,  we  have: 


-3/2  V3/2 

-y/i/2  1/2 


i,i,i)  —  0- 


(6.13) 


.  r  o  o  i  .  _ ip  i  vs  1 

A(2.1)®(1,1.1)  -  [  o  0  J  ’  A(1,1,1)®(2,1)  -  g  [  _^3  _3  J  - 

—  7  -x/3  11  5\/3  ‘ 

.  _  1  -2V3  -10  — 6>/3  -14 

(2, i)®(2, 1)  -  32  20  22^  -4  4\/3 

-ll\/3  -23  —a/3  -13 

To  invoke  Bayes  rule  in  the  Fourier  domain,  we  perform  a  pointwi.se  prod¬ 
uct  using  Equations  6.10,  6.11,  6.12,  and  normalize  by  dividing  by  the  trivial 
coefficient,  which  yields  the  Fourier  transform  of  the  posterior  distribution  as: 


P(a\z) 


=  1, 


P(3) 


P(a\z) 


J  P(2,l) 


-1  0 

0  1 


P(a\z) 


=  -1. 


J  P(l,l,l) 


(6.14) 

Finally,  we  can  see  that  the  result  is  correct  by  recognizing  that  the  Fourier 
transform  of  the  posterior  (Equation  6.14)  corresponds  exactly  to  the  distribution 
which  is  1  at  a  =  (1,  2)  and  0  everywhere  else.  Bob  is  therefore  at  Track  1,  Alice 
at  Track  2  and  Cathy  at  Track  3. 


a 

e 

(1,2) 

(2,3) 

(1,3) 

(1,2,3) 

(1,3,2) 

P(a) 

IEI 

1 

0 

0 

0 

0 

7  Approximate  inference  by  bandlimiting 

We  now  consider  the  consequences  of  performing  inference  using  the  Fourier 
transform  at  a  reduced  set  of  coefficients.  Important  issues  include  understand¬ 
ing  how  error  can  be  introduced  into  the  system,  and  when  our  algorithms  are 
expected  to  perform  well  as  an  approximation.  Specifically,  we  fix  a  bandlimit 
\MIN  and  maintain  the  Fourier  transform  of  P  only  at  irreducibles  which  are 
at  \MIN  or  above  in  the  dominance  ordering: 

A  =  {p\  :  A>AM/iv}. 
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Algorithm  2:  Pseudocode  for  the  Kronecker  Conditioning  Algorithm. 
KroneckerConditioning 

input  :  Fourier  coefficients  of  the  likelihood  function,  Lpx,  p\  G  A l,  and 
Fourier  coefficients  of  the  prior  distribution,  Pp  ,  pp  G  Ap 
output:  Fourier  coefficients  of  the  posterior  distribution,  LPPl/,  G  Ap 

1  foreach  pv  G  Ap  do  LPPl/  <—  0  // Initialize  Posterior 

/ / Pointwise  Product 

2  foreach  p\  G  Ap  do 

3  foreach  pp  G  Ap  do 

4  z  <—  CGseries(p\,  pp)  ; 


5 

6 

7 

8 


9 

10 


C\p  <-  C Gcoef  f  icients (p\ ,  pp)  ;  AAp  <-  •  (lpa  <8  PP(i)  •  CXp  ; 

for  G  Ap  sz/c/i  that  Z\pl,  ^  0  do 

for  i  =  1  to  z\pv  do 


L(*)p(0 


L(«)p(‘) 


17  <-  ILWPW  ; 

J  P(nj 

foreach  p„  £  A  do 


(z/,  block  of  A\p 

i  —i 


J  Pt/ 


L(t)  p(t) 


L(t)  p(‘) 


// Normalization 


For  example,  when  \MIN  =  (n— 2, 1, 1),  A  is  the  set  {p(n)j  P(n-i,i)i  P(n-2,2)  5  and 
P(n— 2,1,1) } ;  which  corresponds  to  maintaining  second-order  (ordered)  marginal 
probabilities  of  the  form  P(a((i,j))  =  (k,£)).  During  inference,  we  follow  the 
procedure  outlined  in  the  previous  section  but  discard  the  higher  order  terms 
which  can  be  introduced  during  the  conditioning  step.  Pseudocode  for  ban- 
dlimited  prediction/rollup  and  Kronecker  conditioning  is  given  in  Algorithms  1 
and  2.  We  note  that  it  is  not  necessary  to  maintain  the  same  number  of  irre- 
ducibles  for  both  prior  and  likelihood  during  the  conditioning  step.  The  first 
question  to  ask  is:  when  should  one  expect  a  bandlimited  approximation  to  be 
close  to  P(cr)  as  a  function?  Qualitatively,  if  a  distribution  is  relatively  smooth, 
then  most  of  its  energy  is  stored  in  the  low-order  Fourier  coefficients.  However, 
in  a  phenomenon  quite  reminiscent  of  the  Heisenberg  uncertainty  principle  from 
quantum  mechanics,  it  is  exactly  when  the  distribution  is  sharply  concentrated 
at  a  small  subset  of  permutations,  that  the  Fourier  projection  is  unable  to 
faithfully  approximate  the  distribution.  We  illustrate  this  uncertainty  effect 
in  Figure  5  by  plotting  the  accuracy  of  a  bandlimited  distribution  against  the 
entropy  of  a  distribution. 

Even  though  the  bandlimited  distribution  is  sometimes  a  poor  approximation 
to  the  true  distribution,  the  marginals  mainatined  by  our  algorithm  are  often 
sufficiently  accurate.  And  so  instead  of  considering  the  approximation  accuracy 
of  the  bandlimited  Fourier  transform  to  the  true  joint  distribution,  we  consider 
the  accuracy  only  at  the  marginals  which  are  maintained  by  our  method. 
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Bandlimiting  Error 


Figure  5:  In  general,  smoother  distributions  are  well  approximated  by  low- 
order  Fourier  projections.  In  this  graph,  we  show  the  approximation  quality  of 
the  Fourier  projections  on  distributions  with  different  entropies,  starting  from 
sharply  peaked  delta  distributions  on  the  left  side  of  the  graph,  which  get  itera¬ 
tively  smoothed  until  they  becomes  the  maximum  entropy  uniform  distribution 
on  the  right  side.  On  the  y- axis,  we  measure  how  much  energy  is  preserved  in 
the  bandlimited  approximation,  which  we  define  to  be  -]pp-,  where  P  is  the  ban- 
dlimited  approximation  to  P.  Each  line  represents  the  approximation  quality 
using  a  fixed  number  of  Fourier  coefficients.  At  one  extreme,  we  achieve  perfect 
signal  reconstruction  by  using  all  Fourier  coefficients,  and  at  the  other,  we  per¬ 
form  poorly  on  “spiky”  distributions,  but  well  on  high-entropy  distributions,  by 
storing  a  single  Fourier  coefficient. 

7.1  Sources  of  error  during  inference 

We  now  analyze  the  errors  incurred  during  our  inference  procedures  with  respect 
to  the  accuracy  at  maintained  marginals.  It  is  immediate  that  the  Fourier 
domain  prediction/rollup  operation  is  exact  due  to  its  pointwise  nature  in  the 
Fourier  domain.  For  example,  if  we  have  the  second  order  marginals  at  time 
t  =  0,  then  we  can  find  the  exact  second  order  marginals  at  all  £  >  0  if  we  only 
perform  predic.tion/rollup  operations.  Instead,  the  errors  in  inference  are  only 
committed  by  Kronecker  conditioning,  where  they  are  implicitly  introduced  at 
coefficients  outside  of  A  (by  effectively  setting  the  coefficients  of  the  prior  and 
likelihood  at  irreducibles  outside  of  A  to  be  zero) ,  then  propagated  inside  to  the 
irreducibles  of  A. 

In  practice,  we  observe  that  the  errors  introduced  at  the  low-order  irre- 
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(b)  n  =  6 


Figure  6:  We  show  the  dominance  ordering  for  partitions  of  n  =  5  and  n  = 
6  again.  By  setting  \MIN  =  (3,1,1)  and  (4,1,1)  respectively,  we  keep  the 
irreducibles  corresponding  to  the  partitions  in  the  dotted  regions.  If  we  call 
Kronecker  Conditioning  with  a  first-order  observation  model,  then  according 
to  Theorem  21,  we  can  expect  to  incur  some  error  at  the  Fourier  coefficients 
corresponding  to  (3, 1, 1)  and  (3, 2)  for  n  =  5,  and  (4, 1, 1)  and  (4,  2)  for  n  =  6 
(shown  as  shaded  tableaux),  but  to  be  exact  at  first-order  coefficients. 


ducibles  during  inference  are  small  if  the  prior  and  likelihood  are  sufficiently 
diffuse,  which  makes  sense  since  the  high-frequency  Fourier  coefficients  are  small 
in  such  cases.  We  can  sometimes  show  that  the  update  is  exact  at  low  order 
irreducibles  if  we  maintain  enough  coefficients. 

Theorem  21.  If  \MIN  =  (n  —  p,  X2, .  ■  ■),  and  the  Kronecker  conditioning  algo¬ 
rithm  is  called  with  a  likelihood  function  whose  Fourier  coefficients  are  nonzero 
only  at  when  (n  —  q,  p 2, . . . ),  then  the  approximate  Fourier  coefficients 

of  the  posterior  distribution  are  exact  at  the  set  of  irreducibles: 

h-EXACT  =  {p\  :  A  >  (n  —  |p  —  q\, .  . . )}. 

Proof.  See  Appendix  B.  □ 

For  example,  if  we  call  Kronecker  conditioning  by  passing  in  third-order 
terms  of  the  prior  and  first-order  terms  of  the  likelihood,  then  all  first  and 
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second-order  (unordered  and  ordered)  marginal  probabilities  of  the  posterior 
distribution  can  be  reconstructed  without  error. 

7.2  Projecting  to  the  marginal  polytope 

Despite  the  encouraging  result  of  Theorem  21,  the  fact  remains  that  consec¬ 
utive  conditioning  steps  can  propagate  errors  to  all  levels  of  the  bandlimited 
Fourier  transform,  and  in  many  circumstances,  results  in  a  Fourier  transform 
whose  “marginal  probabilities”  correspond  to  no  consistent  joint  distribution 
over  permutations,  and  are  sometimes  negative.  To  combat  this  problem,  we 
present  a  method  for  projecting  to  the  space  of  coefficients  corresponding  to 
consistent  joint  distributions  (which  we  will  refer  to  as  the  marginal  polytope) 
during  inference. 

We  begin  by  discussing  the  first-order  version  of  the  marginal  polytope  pro¬ 
jection  problem.  Given  an  nxn  matrix,  M,  of  real  numbers,  how  can  we  decide 
whether  there  exists  some  probability  distribution  which  has  M  as  its  matrix 
of  first-order  marginal  probabilities?  A  necessary  and  sufficient  condition,  as  it 
turns  out,  is  for  M  to  be  doubly  stochastic.  That  is,  all  entries  of  M  must  be 
nonnegative  and  all  rows  and  columns  of  M  must  sum  to  one  (the  probability 
that  Alice  is  at  some  track  is  1,  and  the  probability  that  some  identity  is  at 
Track  3  is  1).  The  double  stochasticity  condition  comes  from  the  Birkhoff-von 
Neumann  theorem  (van  Lint  &  Wilson,  2001)  which  states  that  a  matrix  is 
doubly  stochastic  if  and  only  if  it  can  be  written  as  a  convex  combination  of 
permutation  matrices. 

To  “renormalize”  first-order  marginals  to  be  doubly  stochastic,  some  authors 
(Shin  et  ah,  2003;  Shin  et  ah,  2005;  Balakrishnan  et  ah,  2004;  Helmbold  &  War- 
muth,  2007)  have  used  the  Sinkhorn  iteration ,  which  alternates  between  nor¬ 
malizing  rows  and  columns  independently  until  convergence  is  obtained.  Con¬ 
vergence  is  guaranteed  under  mild  conditions  and  it  can  be  shown  that  the  limit 
is  a  nonnegative  doubly  stochastic  matrix  which  is  closest  to  the  original  matrix 
in  the  sense  that  the  Kullback-Leibler  divergence  is  minimized  (Balakrishnan 
et  ah,  2004). 

There  are  several  problems  which  cause  the  Sinkhorn  iteration  to  be  an  un¬ 
natural  solution  in  our  setting.  First,  since  the  Sinkhorn  iteration  only  works 
for  nonnegative  matrices,  we  would  have  to  first  cap  entries  to  lie  in  the  appro¬ 
priate  range,  [0,1].  More  seriously,  even  though  the  Sinkhorn  iteration  would 
guarantee  a  doubly  stochastic  higher  order  matrix  of  marginals,  there  are  several 
natural  constraints  which  are  violated  when  running  the  Sinkhorn  iteration  on 
higher-order  marginals.  For  example,  with  second-order  (ordered)  marginals,  it 
seems  that  we  should  at  least  enforce  the  following  symmetry  constraint: 

P(a  :  a(k,£)  =  ( i,j ))  =  P(a  :  a(£,k)  =  ( j,i )), 

which  says,  for  example,  that  the  marginal  probability  that  Alice  is  in  Track  1 
and  Bob  is  in  Ttack  2  is  the  same  as  the  marginal  probability  that  Bob  is  in 
Track  2  and  Alice  is  in  Track  1.  Another  natural  constraint  that  can  be  broken 
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is  what  we  refer  to  as  low-order  marginal  consistency.  For  example,  it  should 
always  be  the  case  that: 

n?)  =  Ep(^')  =  Ep0'’fc)- 

i  k 

It  should  be  noted  that  the  doubly  stochastic  requirement  is  a  special  case  of 
lower-order  marginal  consistency  —  we  require  that  higher-order  marginals  be 
consistent  on  the  0th  order  marginal. 

While  compactly  describing  the  constraints  of  the  marginal  polytope  exactly 
remains  an  open  problem,  we  propose  a  method  for  projecting  onto  a  relaxed 
form  of  the  marginal  polytope  which  addresses  both  symmetry  and  low-order 
consistency  problems  by  operating  directly  on  irreducible  Fourier  coefficients 
instead  of  on  the  matrix  of  marginal  probabilities.  After  each  conditioning  step, 
we  apply  a  ‘correction’  to  the  approximate  posterior  P^’  by  finding  the  ban- 
dlimited  function  in  the  relaxed  marginal  polytope  which  is  closest  to  in  an 
L2  sense.  To  perform  the  projection,  we  employ  the  Plancherel  Theorem  (Dia- 
conis,  1988)  which  relates  the  L2  distance  between  functions  on  Sn  to  a  distance 
metric  in  the  Fourier  domain. 

Proposition  22  (Plancherel  Theorem). 


E(/M-^))2  =  1^1  Ed^Tr((^ -9pS)  ■  (fp„  -  gPv)^  ■(  7-1) 


To  find  the  closest  bandlimited  function  in  the  relaxed  marginal  polytope, 
we  formulate  a  quadratic  program  whose  objective  is  to  minimize  the  right 
side  of  Equation  7.1,  and  whose  sum  is  taken  only  over  the  set  of  maintained 
irreducibles,  A,  subject  to  the  set  of  constraints  which  require  all  marginal  prob¬ 
abilities  to  be  nonnegative.  We  thus  refer  to  our  correction  step  as  Plancherel 
Projection.  Our  quadratic  program  can  be  written  as: 


minimize^  E  ^Tr  (/-/V°j)T  (/  -  fproj) 


subject  to: 


AeA 
ip 

J  (n) 

C\MIN 


=  1, 


fproj  I  /~iT 

W  I  '  UXMIN 


>  0,  for  all  (i,  j). 


j  ij 


where  and  C\min  are  the  precomputed  constants  from  Equation  5.4. 

We  remark  that  even  though  the  projection  will  produce  a  Fourier  transform 
corresponding  to  nonnegative  marginals  which  are  consistent  with  each  other, 
there  might  not  necessarily  exist  a  joint  probability  distribution  on  Sn  consistent 
with  those  marginals  except  in  the  special  case  of  first-order  marginals. 
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Example  23.  In  Example  20,  we  ran  the  Kronecker  conditioning  algorithm  us¬ 
ing  all  of  the  Fourier  coefficients.  If  only  the  first-order  coefficients  are  available, 
however,  then  the  expressions  for  zeroth  and  first  order  terms  of  the  posterior 
(Equations  6.10,6.11)  become: 


7~~  =I 
I  9P( 3)  31 

1 

P{  2,1)  ~  3? 


f  ■  9 


fp(  3)  '  9p( 3)  +  4  '  [^(2  , 1)0(2, 1)]  i  i 
fp(  2,1)  '  9p{  3)  fp(3)  '  9p(  2,1)  +  2-  [^4(2, 1)0(2,!)]  2:3, 2:3 


(7.2) 

(7.3) 


Plugging  in  the  same  numerical  values  from  Example  20  and  normalizing  ap¬ 
propriately  yields  the  approximate  Fourier  coefficients  of  the  posterior: 


P(a\z) 


=  1 


J  Pi 3) 


P(a\z) 


J  P(2, 1) 


—  10/9  -77/400 

77/400  4/3 


which  correspond  to  the  following  first-order  marginal  probabilities: 


A 

B 

C 

Track  1 

0 

11/9 

-2/9 

Track  2 

1 

0 

0 

Track  3 

0 

-2/9 

11/9 

In  particular, 
tive  numbers, 
marginals: 


we  see  that  the  approximate  matrix  of  ‘ marginals'  contains  nega- 
Applying  the  Plancherel  projection  step,  we  obtain  the  following 


A 

B 

c ' 

Track  1 

0 

1 

0 

Track  2 

1 

0 

0 

Track  3 

0 

0 

1 

which  happen  to  be  exactly  the  true  posterior  marginals.  It  should  be  noted 
however,  that  rounding  the  ‘marginals’  to  be  in  the  appropriate  range  would 
have  worked  in  this  particular  example  as  well. 


8  Probabilistic  models  of  mixing  and  observa¬ 
tions 

While  the  algorithms  presented  in  the  previous  sections  are  general  in  the  sense 
that  they  work  on  all  mixing  and  observation  models,  it  is  not  always  obvious 
how  to  compute  the  Fourier  transform  of  a  given  model.  In  this  section,  we 
present  ways  to  obtain  such  transforms  for  a  few  useful  models. 

8.1  Mixing  models 

The  simplest  mixing  model  for  identity  management  assumes  that  with  proba¬ 
bility  p,  nothing  happens,  and  that  with  probability  (1  —  p),  the  identities  for 
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tracks  i  and  j  are  swapped.  The  probability  distribution  is  therefore: 

{p  if  7r  =  e 
1  ~P  if  7 r=(i,j) 

0  otherwise. 

Since  Qij  is  such  a  sparse  distribution  (in  the  sense  that  Q(tt)  =  0  for  most  n), 
it  is  possible  to  directly  compute  Q  using  Definition  6: 

Qpx  =pl+  (1  ~p)p\((i,j)), 

where  I  refers  to  the  d\  x  d\  identity  matrix,  and  p\((i,j))  is  the  irreducible 
representation  matrix  p\  evaluated  at  the  transposition  (i,j)  (which  can  be 
computed  using  the  algorithms  from  Appendix  A). 


8.2  Observation  models 


The  simplest  model  assumes  that  we  can  get  observations  of  the  form:  ‘track  £ 
is  color  k’  (which  is  essentially  the  model  considered  by  (Kondor  et  al.,  2007)). 
The  probability  of  seeing  color  k  at  track  £  given  data  association  a  is 

L{(r)  =  P(zt  =  k\a)  =  aaWtk,  (8.1) 

where  J2kaa(e)k  =  1-  For  each  identity,  the  likelihood  L(a)  =  P(z(,  =  k\a) 
depends  on  a  histogram  over  all  possible  colors.  If  the  number  of  possible 
colors  is  I\,  then  the  likelihood  model  can  be  specified  by  an  n  x  I\  matrix  of 
probabilities.  For  example, 


k  =  Red 

k  =  Orange 

k  =  Yellow 

k  =  Green 

a{£)  =  Alice 

1/2 

1/4 

1/4 

0 

cr(£)  =  Bob 

1/4 

0 

0 

3/4 

a(£)  =  Cathy 

0 

1/2 

1/2 

0 

(8.2) 


Since  the  observation  model  only  depends  on  a  single  identity,  the  first- 
order  terms  of  the  Fourier  transform  suffice  to  fully  describe  the  likelihood. 
To  compute  the  first-order  Fourier  coefficients,  at  irreducibles,  we  proceed  by 
computing  the  first-order  Fourier  coefficients  at  the  first-order  permutation  rep¬ 
resentation,  then  transforming  to  irreducible  coefficients.  The  Fourier  transform 
of  the  likelihood  at  the  first-order  permutation  representation  is  given  by: 


L. 


’(n  — 1,1) 


=  E  P(ze  =  k\a)  = 

{cr.a(j)=i} 


E  ■ 

{<r:cr(j)=i} 


To  compute  the  ij- term,  there  are  two  cases  to  consider. 

1.  If  j  =  £  (that  is,  if  Track  j  is  the  same  as  the  track  that  was  observed), 
then  the  coefficient  Lij  is  proportional  to  the  probability  that  Identity  i 
is  color  k. 

Lij  =  ^2  ai,k  =  (n  ~  1)!  •  (8.3) 

{<r:a  (t)=i> 
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2.  If,  on  the  other  hand,  j  ^  l  (Track  j  is  not  the  observed  track)),  then  the 
coefficient  is  proportional  to  the  sum  over 


Lv  —  yi 

^ cr(£),k 

(8.4) 

{a-.a(j)= 

=i} 

=  £ 

aa{t),k 

(8.5) 

y(j)=i  and  tj(l)=m} 

=  H(n’ 

—  2)!  •  Q^77T,,fc- 

(8.6) 

m^i 


Example  24.  We  will  compute  the  first-order  marginals  of  the  likelihood  func¬ 
tion  on  S3  which  arises  from  observing  a  "Red  blob  at  Track  1 Plugging  the 
values  from  the  ‘‘Red”  column  of  the  a  matrix  (Equation  8.2)  into  Equation  8.3 
and  8.6  yields  the  following  matrix  of  first-order  coefficients  (at  the  t (n-i,i) 
permutation  representation) : 


Track  1 

Track  2 

Track  3 

Alice 

1/4 

1/4 

1 

-^(n-1,1) 

ij 

Bob 

1/2 

1/2 

1/2 

Cathy 

3/4 

3/4 

0 

The  corresponding  coefficients  at  the  irreducible  representations  are: 


L(3)  =  1.5, 


L( 2,1)  - 


0  — -v/3/4  ' 

0  -3/4  J  ’ 


^(1,1,1)  —  0. 


9  Related  work 

Rankings  and  permutations  have  recently  become  an  active  area  of  research  in 
machine  learning  due  to  their  importance  in  information  retrieval  and  preference 
elicitation.  Rather  than  considering  full  distributions  over  permutations,  many 
approaches,  like  RankSVM  (Joachims,  2002)  and  RankBoost  (Freund  et  ah, 
2003),  have  instead  focused  on  learning  a  single  ‘optimal’  ranking  with  respect 
to  some  objective  function. 

There  are  also  several  authors  who  have  studied  distributions  over  permu¬ 
tations/rankings  (Mallows,  1957;  Critchlow,  1985;  Fligner  &  Verducci,  1986; 
Taylor  et  al.,  2008;  Lebanon  &  Mao,  2008).  (Taylor  et  ah,  2008)  consider  dis¬ 
tributions  over  Sn  which  are  induced  by  the  rankings  of  n  independent  draws 
from  n  individually  centered  Gaussian  distributions  with  equal  variance.  They 
compactly  summarize  their  distributions  using  an  0(n2)  matrix  which  is  con¬ 
ceptually  similar  to  our  first-order  summaries  and  apply  their  techniques  to 
ranking  web  documents.  Most  other  previous  approaches  at  directly  modeling 
distributions  on  Sn,  however,  have  relied  on  distance  based  models.  For  ex¬ 
ample,  the  Mallows  model  (Mallows,  1957)  defines  a  Gaussian-like  distribution 
over  permutations  as: 

P(ct;c,ct0)  aexp(-cd(cr,  cr0)),  (9-1) 
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where  the  function  d(a,a o)  is  the  Kendall’s  tau  distance  which  counts  the  num¬ 
ber  of  adjacent  swaps  that  are  required  to  bring  cr-1  to  (Jq1  .  Like  Gaussians, 
distance  based  models  tend  to  lack  flexibility,  and  so  (Lebanon  &  Mao,  2008) 
propose  a  nonparametric  model  of  ranked  (and  partially  ranked)  data  based  on 
placing  weighted  Mallows  kernels  on  top  of  training  examples,  which,  as  they 
show,  can  realize  a  far  richer  class  of  distributions,  and  can  be  learned  efficiently. 
However,  they  do  not  address  the  inference  problem,  and  it  is  not  clear  if  one  can 
efficiently  perform  inference  operations  like  marginalization  and  conditioning  in 
such  models. 

As  we  have  shown  in  this  paper,  Fourier  based  methods  (Diaconis,  1988; 
Kondor  et  ah,  2007;  Huang  et  ah,  2007)  offer  a  principled  alternative  method  for 
compactly  representing  distributions  over  permutations  and  performing  efficient 
probabilistic  inference  operations.  Our  work  draws  from  two  strands  of  research 
—  one  from  the  data  association/identity  management  literature,  and  one  from 
a  more  theoretical  area  on  Fourier  analysis  in  statistics.  In  the  following,  we 
review  several  of  the  works  which  have  led  up  to  our  current  Fourier  based 
approach. 

9.1  Previous  work  in  identity  management 

The  identity  management  problem  has  been  addressed  in  a  number  of  previous 
works,  and  is  closely  related  to,  but  not  identical  with,  the  classical  data  associ¬ 
ation  problem  of  maintaining  correspondences  between  tracks  and  observations. 
Both  problems  need  to  address  the  fundamental  combinatorial  challenge  that 
there  is  a  factorial  or  exponential  number  of  associations  to  maintain  between 
tracks  and  identities,  or  between  tracks  and  observations  respectively.  A  vast 
literature  already  exists  on  the  the  data  association  problem,  beginning  with 
the  multiple  hypothesis  testing  approach  (MHT)  of  (Reid,  1979).  The  MHT  is  a 
‘deferred  logic’  method  in  which  past  observations  are  exploited  in  forming  new 
hypotheses  when  a  new  set  of  observations  arises.  Since  the  number  of  hypothe¬ 
ses  can  grow  exponentially  over  time,  various  heuristics  have  been  proposed  to 
help  cope  with  the  complexity  blowup.  For  example,  one  can  choose  to  maintain 
only  the  k  best  hypotheses  for  some  parameter  k  (Cox  &  Hingorani,  1994),  using 
Murty’s  algorithm  (Murty,  1968).  But  for  such  an  approximation  to  be  effec¬ 
tive,  k  may  still  need  to  scale  exponentially  in  the  number  of  objects.  A  slightly 
more  recent  filtering  approach  is  the  joint  probabilistic  data  association  filter 
(JPDA)  (Bar-Shalom  &  Fortmann,  1988),  which  is  a  subopt  i trial  single-stage 
approximation  of  the  optimal  Bayesian  filter.  JPDA  makes  associations  sequen¬ 
tially  and  is  unable  to  correct  erroneous  associations  made  in  the  past  (Poore, 
1995).  Even  though  the  JPDA  is  more  efficient  than  the  MHT,  the  calculation 
of  the  JPDA  association  probabilities  is  still  a  #P-complete  problem  (Collins 
&  Uhlmann,  1992),  since  it  effectively  must  compute  matrix  permanents.  Poly¬ 
nomial  approximation  algorithms  to  the  JPDA  association  probabilities  have 
recently  been  studied  using  Markov  chain  Monte  Carlo  (MCMC)  methods  (Oh 
et  ah,  2004;  Oh  &  Sastry,  2005). 

The  identity  management  problem  was  first  explicitly  introduced  in  (Shin 
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et  ah,  2003).  Identity  management  differs  from  the  classical  data  association 
problem  in  that  its  observation  model  is  not  concerned  with  the  low-level  track¬ 
ing  details  but  instead  with  high  level  information  about  object  identities.  (Shin 
et  ah,  2003)  introduced  the  notion  of  the  belief  matrix  approximation  of  the  asso¬ 
ciation  probabilities,  which  collapses  a  distribution  over  all  possible  associations 
to  just  its  first-order  marginals.  In  the  case  of  n  tracks  and  n  identities,  the 
belief  matrix  B  is  an  n  x  n  doubly-stochastic  matrix  of  non-negative  entries 
bij,  where  bij  is  the  probability  that  identity  i  is  associated  with  track  j.  As 
we  already  saw  in  Section  4,  the  belief  matrix  approximation  is  equivalent  to 
maintaining  the  zeroth-  and  first-order  Fourier  coefficients.  Thus  our  current 
work  is  a  strict  generalization  and  extension  of  those  previous  results. 

An  alternative  representation  that  has  also  been  considered  is  an  information 
theoretic  approach  (Shin  et  ah,  2005;  Schumitsch  et  ah,  2005;  Schumitsch  et  ah, 
2006)  in  which  the  density  is  parameterized  as: 

P(a;  fl)  oc  exp  Tr  (f lT  ■  t(„_ m)  (ct))  . 

In  our  framework,  the  information  form  approach  can  be  viewed  as  a  method  for 
maintaining  the  Fourier  transform  of  the  log  probability  distribution  at  only  the 
first  two  irreducibles.  The  information  matrix  approach  is  especially  attractive 
in  a  distributed  sensor  network  setting,  since,  if  the  columns  of  the  information 
matrix  are  distributed  to  leader  nodes  tracking  the  respective  targets,  then  the 
observation  events  become  entirely  local  operations,  avoiding  the  more  expen¬ 
sive  Kronecker  conditioning  algorithm  in  our  setting.  On  the  other  hand,  the 
information  matrix  coefficients  do  not  have  the  same  intuitive  marginals  inter¬ 
pretation  afforded  in  our  setting,  and  moreover,  prediction/rollup  steps  cannot 
be  performed  analytically  in  the  information  matrix  form.  As  in  many  clas¬ 
sical  data  structures  problems  there  are  representation  trade-off  issues:  some 
operations  are  less  expensive  in  one  representation  and  some  operations  in  the 
the  other.  The  best  choice  in  any  particular  scenario  will  depend  on  the  ratio 
between  observation  and  mixing  events. 

9.2  Previous  work  on  Fourier-based  approximations 

The  concept  of  using  Fourier  transforms  to  study  probability  distributions  on 
groups  is  not  new,  with  the  earliest  papers  in  this  area  having  been  published 
in  the  1960s  (Grenander,  1963).  (Willsky,  1978)  was  the  first  to  formulate  the 
exact  filtering  problem  in  the  Fourier  domain  for  finite  and  locally  compact 
Lie  groups  and  contributed  the  first  noncommutative  Fast  Fourier  Transform 
algorithm  (for  Metacyclic  groups).  However,  he  does  not  address  approximate 
inference,  suggesting  instead  to  always  transform  to  the  appropriate  domain 
for  which  either  the  prediction/rollup  or  conditioning  operations  can  be  accom¬ 
plished  using  a  pointwise  product.  While  providing  significant  improvements  in 
complexity  for  smaller  groups,  his  approach  is  still  infeasible  for  our  problem 
given  the  factorial  order  of  the  Symmetric  group. 

(Diaconis,  1988)  utilized  the  Fourier  transform  to  analyze  probability  distri¬ 
butions  on  the  Symmetric  group  in  order  to  study  card  shuffling  and  ranking 
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problems.  His  work  laid  the  ground  for  much  of  the  progress  made  over  the  last 
two  decades  on  probabilistic  group  theory  and  noncommutative  FFT  algorithms 
(Clausen  &  Baum,  1993;  Rockmore,  2000). 

(Kondor  et  ah,  2007)  was  the  first  to  show  that  the  data  association  prob¬ 
lem  could  be  efficiently  approximated  using  FFT  factorizations.  In  contrast  to 
our  framework  where  every  model  is  assumed  to  be  have  been  specified  in  the 
Fourier  domain,  they  work  with  an  observation  model  which  can  be  written  in 
the  primal  domain. 

Conceptually,  their  conditioning  algorithm  applies  the  Inverse  Fast  Fourier 
Transform  (IFFT)  to  the  prior  distribution,  conditions  in  the  primal  domain 
using  pointwise  multiplication,  then  transforms  back  up  to  the  Fourier  domain 
using  the  FFT  to  obtain  posterior  Fourier  coefficients.  While  their  procedure 
would  ordinarily  be  intractable  because  of  the  factorial  number  of  permutations, 
they  show  that  for  simple  observation  models,  such  as  that  given  in  Equation  8.1, 
it  is  not  necessary  to  perform  the  full  FFT  recursion  to  do  a  pointwise  prod¬ 
uct.  They  exploit  this  observation  to  formulate  a  conditioning  algorithm  whose 
running  time  depends  on  the  complexity  of  the  observation  model  (which  can 
roughly  be  measured  by  the  number  of  irreducibles  required  to  fully  specify  it). 
In  the  worst  case,  when  the  likelihood  function  is  specified  for  each  cr  €  Sn,  then 
the  cost  of  conditioning  is  dominated  by  the  cost  of  calling  an  FFT,  which  is 
0(n!  logn!). 

In  the  case  that  the  observation  model  is  specified  at  sufficiently  many  irre¬ 
ducibles,  our  conditioning  algorithm  (prior  to  the  projection  step)  returns  the 
same  approximate  probabilities  as  the  FFT-based  algorithm.  For  example,  we 
can  show  that  the  observation  model  given  in  Equation  8.1  is  fully  specified  by 
two  Fourier  components,  and  that  both  algorithms  have  identical  output.  In 
this  setting,  our  asymptotic  time  complexity  is  0(D3n2),  where  D  is  the  degree 
of  the  largest  maintained  irreducible  representation.  The  FFT-based  algorithm 
saves  a  factor  of  D  due  to  the  fact  that  certain  representation  matrices  can  be 
shown  to  be  sparse.  Though  we  do  not  prove  it,  we  observe  that  the  Clebsch- 
Gordan  coefficients  Cij  are  typically  similarly  sparse  (see  Figure  7(d)),  which 
yields  an  equivalent  running  time  in  practice.  In  addition,  Kondor  et  al.  do  not 
address  the  issue  of  projecting  onto  legal  distributions,  which,  as  we  show  in 
our  experimental  results  is  fundamental  in  practice. 

10  Experimental  results 

In  this  section  we  present  the  results  of  several  experiments  to  validate  our  algo¬ 
rithm.  We  evaluate  performance  first  by  measuring  the  quality  of  our  approxi¬ 
mation  for  problems  where  the  true  distribution  is  known.  Instead  of  measuring 
a  distance  between  the  true  distribution  and  the  inverse  Fourier  transform  of 
our  approximation,  it  makes  more  sense  in  our  setting  to  measure  error  only 
at  the  marginals  which  are  maintained  by  our  approximation.  In  the  results 
reported  below,  we  measure  the  L\  error  between  the  true  matrix  of  marginals 
and  the  approximation.  If  nonnegative  marginal  probabilities  are  guaranteed, 
it  also  makes  sense  to  measure  KL-divergence. 
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KL(true||bandlimited  approximation) 


Projection  versus  No  Projection  (n=6) 
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(a)  Kronecker  Conditioning  Accuracy  —  we  (b)  HMM  Accuracy  —  we  measure  the  aver- 
measure  the  accuracy  of  a  single  Kronecker  age  accuracy  of  posterior  marginals  over  250 
conditioning  operation  after  some  number  of  timesteps,  varying  the  proportion  of  mixing 
mixing  events.  and  observation  events 


Running  time  of  10  forward  algorithm  iterations 


Sparsity  of  Clebsch-Gordan  coefficients 


(c)  Running  times:  We  compared  running  (d)  Clebsch-Gordan  Sparsity:  We  measured 
times  of  our  polynomial  time  bandlimited  the  sparsity  of  the  Clebsch-Gordan  coeffi- 
inference  algorithms  against  an  exact  algo-  dents  matrices  by  plotting  the  reciprocal  of 
rithm  with  0(n3n\)  time  complexity  the  fraction  of  nonzero  entries  against  n.  A 

straight  line  in  the  plot  means  that  the  num¬ 
ber  of  nonzero  entries  scales  linearly  in  n,  and 
a  convex  curve  scales  better  than  linearly. 


Figure  7: 


10.1  Simulated  data 

We  first  tested  the  accuracy  of  a  single  Kronecker  conditioning  step  by  calling 
some  number  of  pairwise  mixing  events  (which  can  be  thought  roughly  as  a 
measure  of  entropy),  followed  by  a  single  first-order  observation.  In  the  y-axis 
of  Figure  7(a),  we  plot  the  Kullback-Leibler  divergence  between  the  true  first- 
order  marginals  and  approximate  first-order  marginals  returned  by  Kronecker 
conditioning.  We  compared  the  results  of  maintaining  first-order,  and  second- 
order  (unordered  and  ordered)  marginals.  As  shown  in  Figure  7(a),  Kronecker 
conditioning  is  more  accurate  when  the  prior  is  smooth  and  unsurprisingly, 
when  we  allow  for  higher  order  Fourier  terms.  As  guaranteed  by  Theorem  21, 
we  also  see  that  the  first-order  terms  of  the  posterior  are  exact  when  we  maintain 
second-order  (ordered)  marginals. 


43 


To  understand  how  our  algorithms  perform  over  many  timesteps  (where 
errors  can  propagate  to  all  Fourier  terms),  we  compared  to  exact  inference 
on  synthetic  datasets  in  which  tracks  are  drawn  at  random  to  be  observed  or 
swapped.  As  a  baseline,  we  show  the  accuracy  of  a  uniform  distribution.  We 
observe  that  the  Fourier  approximation  is  better  when  there  are  either  more 
mixing  events  (the  fraction  of  conditioning  events  is  smaller),  or  when  more 
Fourier  coefficients  are  maintained,  as  shown  in  Figure  7(b).  We  also  see  that 
the  Plancherel  Projection  step  is  fundamental,  especially  when  mixing  events 
are  rare. 

Figures  8(a)  and  8(b)  show  the  per-timeslice  accuracy  of  two  typical  runs  of 
the  algorithm.  The  fraction  of  conditioning  events  is  50%  in  Figure  8(a),  and 
70%  in  Figure  8(b).  What  we  typically  observe  is  that  while  the  projected  and 
nonprojected  accuracies  are  often  quite  similar,  the  nonprojected  marginals  can 
perform  significantly  worse  during  certain  segments. 

Finally,  we  compared  running  times  against  an  exact  inference  algorithm 
which  performs  prediction / rollup  in  the  Fourier  domain  and  conditioning  in  the 
primal  domain.  Instead  of  the  naive  0((n!)2)  complexity,  its  running  time  is 
a  more  efficient  0(n3n\)  due  to  the  Fast  Fourier  Transform  (Clausen  &  Baum, 
1993).  It  is  clear  that  our  algorithm  scales  gracefully  compared  to  the  exact 
solution  (Figure  7(c)),  and  in  fact,  we  could  not  run  exact  inference  for  n  >  8  due 
to  memory  constraints.  In  Figure  7(d),  we  show  empirically  that  the  Clebsch- 
Gordan  coefficients  are  indeed  sparse,  supporting  our  conjectured  runtime  of 
0(D2n2)  instead  of  0(D3n2). 

10.2  Real  camera  network 

We  also  evaluated  our  algorithm  on  data  taken  from  a  real  network  of  eight 
cameras  (Fig.  9(a)).  In  the  data,  there  are  n  =  11  people  walking  around  a 
room  in  fairly  close  proximity.  To  handle  the  fact  that  people  can  freely  leave 
and  enter  the  room,  we  maintain  a  list  of  the  tracks  which  are  external  to  the 
room.  Each  time  a  new  track  leaves  the  room,  it  is  added  to  the  list  and  a 
mixing  event  is  called  to  allow  for  mr  pairwise  swaps  amongst  the  m  external 
tracks. 

The  number  of  mixing  events  is  approximately  the  same  as  the  number  of 
observations.  For  each  observation,  the  network  returns  a  color  histogram  of 
the  blob  associated  with  one  track  track.  The  task  after  conditioning  on  each 
observation  is  to  predict  identities  for  all  tracks  which  are  inside  the  room, 
and  the  evaluation  metric  is  the  fraction  of  accurate  predictions.  We  compared 
against  a  baseline  approach  of  predicting  the  identity  of  a  track  based  on  the 
most  recently  observed  histogram  at  that  track.  This  approach  is  expected  to 
be  accurate  when  there  are  many  observations  and  discriminative  appearance 
models,  neither  of  which  our  problem  afforded.  As  Figure  9(b)  shows,  both 
the  baseline  and  first  order  model(without  projection)  fared  poorly,  while  the 
projection  step  dramatically  boosted  the  prediction  accuracy  for  this  problem. 
To  illustrate  the  difficulty  of  predicting  based  on  appearance  alone,  the  rightmost 
bar  reflects  the  performance  of  an  omniscient  tracker  who  knows  the  result  of 


44 


Per  timeslice  accuracy 


Per  timeslice  accuracy 


(b)  n  =  6  with  30%  mixing  events  and  70%  observations 


Figure  8:  Accuracy  as  a  function  of  time  on  two  typical  runs. 


each  mixing  event  and  is  therefore  left  only  with  the  task  of  distinguishing 
between  appearances.  We  conjecture  that  the  performance  of  our  algorithm 
(with  projection)  is  near  optimal. 

11  Future  research 

There  remain  several  possible  extensions  to  the  current  work  stemming  from 
both  practical  and  theoretical  considerations.  We  list  a  few  open  questions  and 
extensions  in  the  following. 

Adaptive  filtering.  While  our  current  algorithms  easily  beat  exact  inference 
in  terms  of  running  time,  they  are  still  limited  by  a  relatively  high  (though 
polynomial)  time  complexity.  In  practice  however,  it  seems  reasonable  to  believe 
that  the  “difficult”  identity  management  problems  typically  involve  only  a  small 
subset  of  people  at  a  time.  A  useful  extension  of  our  work  would  be  to  devise 
an  adaptive  version  of  the  algorithm  which  allocates  more  Fourier  coefficients 
towards  the  identities  which  require  higher  order  reasoning.  We  believe  that 
this  kind  of  extension  would  be  the  appropriate  way  to  scale  our  algorithm  to 
handling  massive  numbers  of  objects  at  a  time. 
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(a)  Sample  Image 


(b)  Accuracy  for  Camera  Data 


Figure  9:  Evaluation  on  dataset  from  a  real  camera  network. 


Characterizing  the  marginal  polytope.  In  our  paper,  we  presented  a  pro¬ 
jection  of  the  bandlimited  distribution  to  a  certain  polytope,  which  is  exactly 
the  marginal  polytope  for  first-order  bandlimited  distributions,  but  strictly  an 
outer  bound  for  higher  orders.  An  interesting  project  would  be  to  generalize  the 
Birkhoff-von  Neumann  theorem  by  exactly  characterizing  the  marginal  polytope 
at  higher  order  marginals.  We  conjecture  that  the  marginal  polytope  for  low 
order  marginals  can  be  described  with  polynomially  many  constraints. 

Learning  in  the  Fourier  domain.  Another  interesting  problem  is  whether 
we  can  learn  bandlimited  mixing  and  observation  models  directly  in  the  Fourier 
domain.  Given  fully  observed  permutations  <ri, . . .  ,am,  drawn  from  a  distribu¬ 
tion  P(cr ),  a  naive  method  for  estimating  Pp  at  low-order  p  is  to  simply  observe 
that: 

Pp  =  E  <r~p[p(a)\, 

and  so  one  can  estimate  the  Fourier  transform  by  simply  averaging  p(cri)  over 
all  ov.  However,  since  we  typically  do  not  observe  full  permutations  in  real 
applications  like  ranking  or  identity  management,  it  would  be  interesting  to  es¬ 
timate  Fourier  transforms  using  partially  observed  data.  In  the  case  of  Bayesian 
learning,  it  may  be  possible  to  apply  some  of  the  techniques  discussed  in  this 
paper. 

Probabilistic  inference  on  other  groups.  The  Fourier  theoretic  framework 
presented  in  this  paper  is  not  specific  to  the  Symmetric  group  -  in  fact,  the  pre¬ 
diction/rollup  and  conditioning  formulations,  as  well  as  most  of  the  results  from 
Appendix  B  hold  over  any  finite  or  compact  Lie  group.  As  an  example,  the  non- 
commutative  group  of  rotation  operators  in  three  dimensions,  SO(3),  appears 
in  settings  which  model  the  pose  of  a  three  dimensional  object.  Elements  in 
SO( 3)  might  be  used  to  represent  the  pose  of  a  robot  arm  in  robotics,  or  the 
orientation  of  a  mesh  in  computer  graphics;  In  many  settings,  it  would  be  use¬ 
ful  to  have  a  compact  representation  of  uncertainty  over  poses.  We  believe  that 
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there  are  many  other  application  domains  with  algebraic  structure  where  similar 
probabilistic  inference  algorithms  might  apply,  and  in  particular,  that  noncom- 
mutative  settings  offer  a  particularly  challenging  but  exciting  opportunity  for 
machine  learning  research. 

12  Conclusions 

We  have  presented  an  intuitive  method  for  compactly  summarizing  distributions 
on  permutations  with  Fourier  analytic  interpretations  and  tuneable  approxi¬ 
mation  quality.  We  showed  that  the  Fourier  theoretic  point  of  view  makes  it 
possible  to  formulate  general  inference  operations  completely  in  the  Fourier  do¬ 
main.  In  particular,  we  developed  the  Kronecker  Conditioning  algorithm  which 
performs  a  convolution-like  operation  on  Fourier  coefficients  to  find  the  Fourier 
transform  of  the  posterior  distribution.  We  analyzed  the  sources  of  error  in  our 
approximations  and  argued  that  bandlimited  conditioning  can  result  in  Fourier 
coefficients  which  correspond  to  no  valid  distribution,  but  that  the  problem  can 
be  remedied  by  projecting  to  a  relaxation  of  the  marginal  polytope. 

Our  evaluation  on  data  from  a  camera  network  shows  that  our  methods 
perform  w’ell  when  compared  to  the  optimal  solution  in  small  problems,  or  to 
an  omniscient  tracker  in  large  problems.  Furthermore,  we  demonstrated  that 
our  projection  step  is  fundamental  to  obtaining  these  high-quality  results. 

Finally  we  conclude  by  remarking  again  that  the  mathematical  framework 
developed  in  our  paper  is  quite  general.  In  fact,  both  the  prediction/rollup 
and  conditioning  formulations  hold  over  any  finite  group,  providing  a  principled 
method  for  approximate  inference  for  problems  with  underlying  group  structure. 
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A  Constructing  irreducible  representation  matri¬ 
ces 

In  this  section,  we  present  (without  proof)  some  standard  algorithms  for  con¬ 
structing  the  irreducible  representation  matrices  with  respect  to  the  Gel’fand- 
Tsetlin  (GZ)  basis  (for  a  more  elaborate  discussion,  see,  for  example,  (Kondor, 
2006;  Chen,  1989;  Vershik  &  Okounkov,  2006)).  There  are  several  properties 
which  make  the  irreducible  representation  matrices,  written  with  respect  to  the 
GZ  basis,  fairly  useful  in  practice.  They  are  guaranteed  to  be,  for  example, 
real-valued  and  orthogonal.  And  as  we  will  show,  the  matrices  have  certain 
useful  sparsity  properties  that  can  be  exploited  in  implementation. 

We  begin  by  introducing  a  few  concepts  relating  to  Young  tableaux  which  are 
like  Young  tabloids  with  the  distinction  that  the  rows  are  considered  as  ordered 
tuples  rather  than  unordered  sets.  For  example,  the  following  two  diagrams  are 
distinct  as  Young  tableaux,  but  not  as  Young  tabloids: 
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(as  Young  tableaux). 
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A  Young  Tableau  t  is  said  to  be  standard  if  its  entries  are  increasing  to  the 
right  along  rows  and  down  columns.  For  example,  the  set  of  all  standard  Young 
Tableaux  of  shape  A  =  (3,  2)  is: 
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(A.l) 


Given  a  permutation  a  £  Sn,  one  can  always  apply  a  to  a  Young  tableau  t  to 
get  a  new  Young  tableau,  which  we  denote  by  a  o  t,  by  permuting  the  labels 
within  the  tableau.  For  example, 


(1,2)  < 


Note,  however,  that  even  if  t  is  a  standard  tableau,  cr  o  t  is  not  guaranteed  to 
be  standard. 

The  significance  of  the  standard  tableaux  is  that  the  set  of  all  standard 
tableaux  of  shape  A  can  be  used  to  index  the  set  of  GZ  basis  vectors  for  the  irre¬ 
ducible  representation  p\.  Since  there  are  five  total  standard  tableaux  of  shape 
(3,2),  we  see,  for  example,  that  the  irreducible  corresponding  to  the  partition 
(3,  2)  is  5-dimensional.  There  is  a  simple  recursive  procedure  for  enumerating 
the  set  of  all  standard  tableaux  of  shape  A,  which  we  illustrate  for  A  =  (3,  2). 


Example  25.  If  A  =  (3,2),  there  are  only  two  possible  boxes  that  the  label  5 
can  occupy  so  that  both  rows  and  columns  are  increasing.  They  are: 


To  enumerate  the  set  of  all  standard  tableaux  of  shape  (3,2),  we  need  to  fill  the 
empty  boxes  in  the  above  partially  filled  tableaux  with  the  labels  {1,2, 3, 4}  so 
that  both  rows  and  columns  are  increasing.  Enumerating  the  standard  tableaux 
of  shape  (3, 2)  thus  reduces  to  enumerating  the  set  of  standard  tableaux  of  shapes 
(2,2)  and  (3,1),  respectively.  For  (2,2),  the  set  of  standard  tableaux  (which,  in 
implementation  would  be  computed  recursively)  is: 
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and  for  (3, 1),  the  set  of  standard  tableaux  is: 
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The  entire  set  of  standard  tableaux  of  shape  (3,  2)  is  therefore: 
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Before  explicitly  constructing  the  representation  matrices,  we  must  define  a 
signed  distance  on  Young  Tableaux  called  the  axial  distance. 
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Definition  26.  The  axial  distance,  dt(i,j),  between  entries  i  and  j  in  tableau 
t,  is  defined  to  be: 

=  (col(t,j)  -  col(t,i))  -  ( row(t,j )  -row(t,i)), 

where  row(t,  i)  denotes  the  row  of  label  i  in  tableau  t,  and  col(t,  i)  denotes  the 
column  of  label  i  in  tableau  t. 


Intuitively,  the  axial  distance  between  i  —  1  and  i  in  a  standard  tableau  t  is 
equal  to  the  (signed)  number  of  steps  that  are  required  to  travel  from  i  —  1  to 
i,  if  at  each  step,  one  is  allowed  to  traverse  a  single  box  in  the  tableau  in  one  of 
the  four  cardinal  directions.  For  example,  the  axial  distance  from  3  to  4  with 
respect  to  tableau:  t  = 


is: 


dt (3,4)  =  ( col  ( [|j|p,  4^  -  col  (  Pf p,  3)  )  -  ( row  ( |||fP, 4)  -  row  (  HHP 1 2  3)  ) 


=  (1-3)  -(2-1)  =  -3 


A.l  Constructing  representation  matrices  for  adjacent  trans¬ 
positions 

In  the  following  discussion,  we  will  consider  a  fixed  ordering,  t±, . . .  ,tdx,  on  the 
set  of  standard  tableaux  of  shape  A  and  refer  to  both  standard  tableaux  and 
columns  of  p\{a)  interchangeably.  Thus  t\  refers  to  first  column,  t2  refers  to 
the  second  column  and  so  on.  And  we  will  index  elements  in  p\(a)  using  pairs 
of  standard  tableau,  ( tj,tk ). 

To  explicitly  define  the  representation  matrices  with  respect  to  the  GZ  basis, 
we  will  first  construct  the  matrices  for  adjacent  transpositions  (i.e.,  permuta¬ 
tions  of  the  form  (i  —  l,i)),  and  then  we  will  construct  arbitrary  representation 
matrices  by  combining  the  matrices  for  the  adjacent  transpositions.  The  rule 
for  constructing  the  matrix  coefficient  [p\(i  —  1 ,  i)]t  tk  is  as  follows. 

1.  Define  the  ( tj,tk )  coefficient  of  p\{i—  1,  i)  to  be  zero  if  it  is  (1),  off-diagonal 
( j  A  k )  and  (2),  not  of  the  form  ( tj ,  ( i  —  1,  i)  o  tk)- 

2.  If  ( tj,tk )  is  a  diagonal  element,  (i.e.,  of  the  form  (tj,tj)),  define: 

[p\{i  -  =  l/dtj{i  —  1, *), 

where  dtAi  —  1  ,i)  is  the  axial  distance  which  we  defined  earlier  in  the 
section. 

3.  If  ( tj,tk )  can  be  written  as  ( tj ,  (*  —  l,i)  o  tj)  define: 

\px{i  -  1,  i)\tjtCrotj  =  ijl-  l/dt.(i  -  M). 

Note  that  the  only  time  that  off-diagonal  elements  can  be  nonzero  under  the 
above  rules  is  when  (i  —  i,i)  o  tj  happens  to  also  be  a  standard  tableau.  If  we 
apply  an  adjacent  transposition,  a  =  (*  —  1,  i)  to  a  standard  tableau  t,  then  aot 
is  guaranteed  to  be  standard  if  and  only  if  z  —  1  and  i  were  neither  in  the  same 
row  nor  column  of  t.  This  can  be  seen  by  examining  each  case  separately. 
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Algorithm  3:  Pseudocode  for  computing  irreducible  representations  ma¬ 
trices  with  respect  to  the  Gel’fand-Tsetlin  basis  at  adjacent  transpositions. 
ADJACENTRHO 
input  :  i  £  {2, . . . ,  n},  A 
output:  p\(i  —  1,  i) 

1  P  4  Qd\xd\  > 

2  foreach  standard  tableaux  t  of  shape  A  do 

3  d  <—  ( col(t ,  i )  —  col(t ,  i  —  1))  —  ( row(t , i)  —  row(t ,  i  —  1)); 

4  p(t,t)<-l/d; 

5  if  z  —  1  and  i  are  in  different  rows  and  columns  of  t  then 

6  p((i  -  1,  «)(t),  t)  \A  -  1/d2; 

7  return  p  ; 


1.  i  —  1  and  i  are  in  the  same  row  or  same  column  of  t.  If  i  and  i  —  1 

are  in  the  same  row  of  t,  then  i—l  lies  to  the  left  of  i.  Applying  a  o  t 
swaps  their  positions  so  that  i  lies  to  the  left  of  i  —  1 ,  and  so  we  see  that 
trot  cannot  be  standard.  For  example, 


(3,4)o 
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Similarly,  we  see  that  if  i  and  i—l  are  in  the  same  column  of  t,  a  o  t 
cannot  be  standard.  For  example, 


(3,4) 
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2.  i  —  1  and  i  are  neither  in  the  same  row  nor  column  of  t.  In  the 

second  case,  a  o  t  can  be  seen  to  be  a  standard  tableau  due  to  the  fact 
that  i  —  l  and  i  are  adjacent  indices.  For  example, 


i 

2 

1 

3 

i 

2 

3 

4 

5 

r 

3 

5 

Therefore,  to  see  if  (i  —  l,i)  o  t  is  standard,  we  need  only  check  to  see  that 
i—l  and  i  are  in  different  rows  and  columns  of  the  tableau  t.  The  pseudocode 
for  constructing  the  irreducible  representation  matrices  for  adjacent  swaps  is 
summarized  in  Algorithm  3.  Note  that  the  matrices  constructed  in  the  algorithm 
are  sparse,  with  no  more  than  two  nonzero  elements  in  any  given  column. 


Example  27.  We  compute  the  representation  matrix  of  evaluated  at  the 

adjacent  transposition  a  =  (i  —  l,i)  =  (3,4).  For  this  example,  we  will  use  the 
enumeration  of  the  standard  tableaux  of  shape  (3,2)  given  in  Equation  A.l. 

For  each  (3,  2) -tableau  tj,  we  identify  whether  aotj  is  standard  and  compute 
the  axial  distance  from  3  to  4  on  the  tableau  tj . 
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j 

12  3f5 

1 

EX 

2 

P 

1 

3 

P 

i 

2 

P 

1 

2 

P 

_2 

3 

□ 

4_ 

_2_ 

_5_ 

3^ 

_5_ 

A_ 

_5_ 

(3,4)  o  tj 

I 

~4 

i 

2~ 

P 

□ 

□ 

T 

~2 

P 

E 

0 

P 

0 

4_ 

0 

0 

r 

4_ 

5_ 

□ 

□ 

(3,4)o  tj  Standard? 

No  No  No  Yes  Yes 

axial  distance  (dtj  (3,4)) 

-1113  -3 

Putting  the  results  together  in  a  matrix  yields:, 


P(  3, 2)  (3, 4) 


t\  t2  ^3  t4  1 5 

t\ 

-1 

t2 

1 

h 

1 

U 

i  A 

/i"  l 

V  9  3 

where  all  of  the  empty  entries  are  zero. 


A. 2  Constructing  representation  matrices  for  general  per¬ 
mutations 

To  construct  representation  matrices  for  general  permutations,  it  is  enough  to 
observe  that  all  permutations  can  be  factored  into  a  sequence  of  adjacent  swaps. 
For  example,  the  permutation  (1,2,5)  can  be  factored  into: 

(1,2,5)  =  (4, 5)(3, 4)(1, 2)(2, 3)(3, 4)(4, 5), 

and  hence,  for  any  partition  A, 

Px{  1, 2, 5)  =  px{ 4, 5)  •  ^ (3, 4)  •  pa(1,  2)  •  Px(2, 3)  •  px( 3, 4)  •  pA(4, 5), 

since  px  is  a  group  representation.  Algorithmically,  factoring  a  permutation  into 
adjacent  swaps  looks  very  similar  to  the  Bubblesort  algorithm,  and  we  show  the 
pseudocode  in  Algorithm  4. 


B  Decomposing  the  tensor  product  representa¬ 
tion 

We  now  turn  to  the  Tensor  Product  Decomposition  problem,  which  is  that  of 
finding  the  irreducible  components  of  the  typically  reducible  tensor  product 
representation.  If  px  and  p M  are  irreducible  representations  of  Sn,  then  there 
exists  an  intertwining  operator  CXfJ,  such  that: 

C'a ij.  1  •  {px  ®  Pm(ct))  ‘  CV  =  (J)  (J)  Piz{cr)-  (B.l) 

v  t=  1 
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Algorithm  4:  Pseudocode  for  computing  irreducible  representation  ma¬ 
trices  for  arbitrary  permutations. 

GETRHO 

input  :  (?€  Sn,  A 

output:  p\(a)  (a  d\  x  d\  matrix) 

1  //  f/se  Bubblesort  to  factor  a  into  a  product  of  transpositions 

2  k  <—  0  ; 

3  factors  <—  0; 

4  for  i  =  1,  2, . . . ,  n  do 

5  for  j  =  n,  n  —  1,  ...,*  +  1  do 

6  if  er(j’)  <  cr(j  —  1)  then 

7  Swap(cr(j  —  l),tr(j))  ; 

8  k  <—  k  +  1 :  j 

9  f actor s(k)  <—  j  ; 

10  // Construct  representation  matrix  using  adjacent  transpositions 

11  Pa(o-)  <—  ; 

12  m  <—  length)  factors)-, 

13  for  j  =  1, . . . ,  m  do 

14  P\{&)  <—  GETADJACENTRHO  (factors{j),  A)  ■  p\(a)  ; 


In  this  section,  we  will  present  a  set  of  numerical  methods  for  computing  the 
Clebsch-Gordan  series  (z\Ml/)  and  Clebsch-Gordan  coefficients  ( C\ M)  for  a  pair 
of  irreducible  representations  p\  <g)  p^.  We  begin  by  discussing  two  methods 
for  computing  the  Clebsch-Gordan  series.  In  the  second  section,  we  provide  a 
general  algorithm  for  computing  the  intertwining  operators  which  relate  two 
equivalent  representations  and  discuss  how  it  can  be  applied  to  computing 
the  Clebsch-Gordan  coefficients  (Equation  B.l)  and  the  matrices  which  relate 
marginal  probabilities  to  irreducible  Fourier  coefficients  (Equation  5.4). 

B.l  Computing  the  Clebsch-Gordan  series 

We  begin  with  a  simple,  well-known  algorithm  based  on  group  characters  for 
computing  the  Clebsch-Gordan  series  that  turns  out  to  be  computationally  in¬ 
tractable,  but  yields  several  illuminating  theoretical  results.  See  (Serre,  1977) 
for  proofs  of  the  theoretical  results  cited  in  this  section. 

One  of  the  main  results  of  representation  theory  was  the  discovery  that  there 
exists  a  relatively  compact  way  of  encoding  any  representation  up  to  equivalence 
with  a  vector  which  we  call  the  character  of  the  representation.  If  p  is  a  rep¬ 
resentation  of  a  group  G,  then  the  character  of  the  representation  p,  is  defined 
simply  to  be  the  trace  of  the  representation  at  each  element  cr  £  G: 

Xp(a)  =  Tr  (p(cr)) . 

The  reason  characters  have  been  so  extensively  studied  is  that  they  uniquely 
characterize  a  representation  up  to  equivalence  in  the  sense  that  two  characters 
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XPl  and  Xp2  are  equal  if  and  only  if  p\  and  P2  are  equivalent  as  representations. 
Even  more  surprising  is  that  the  space  of  possible  group  characters  is  orthog¬ 
onally  spanned  by  the  characters  of  the  irreducible  representations.  To  make 
this  precise,  we  first  define  an  inner  product  on  functions  from  G. 

Definition  28.  Let  (f>,  if  be  two  real-valued  functions  on  G.  The  inner  product 
of  (f)  and  if  is  defined  to  be: 

(0>^>  =  ji  ^(o-)V’(cr) 

With  respect  to  the  above  inner  product,  we  have  the  following  important 
result  which  allows  us  to  test  a  given  representation  for  irreducibility,  and  to 
test  two  irreducibles  for  equivalence. 

Proposition  29.  Let  xPi  and  XP2  be  characters  corresponding  to  irreducible 
representations.  Then 


ty  y  )  =  (  1 

\XpiiXp2/  |  q  otherwise 

Proposition  29  shows  that  the  irreducible  characters  form  an  orthonormal 
set  of  functions.  The  next  proposition  says  that  the  irreducible  characters  span 
the  space  of  all  possible  characters. 

Proposition  30.  Suppose  p  is  any  representation  of  G  and  which  decomposes 
into  irreducibles  as: 

P  =  ®  ©^A’ 

A  t=l 

where  A  indexes  over  all  irreducibles  of  G.  Then: 

1.  The  character  of  p  is  a  linear  combination  of  irreducible  characters  (xP  = 
Ea  z*Xpx), 

2.  and  the  multiplicity  of  each  irreducible,  z\,  can  be  recovered  using  (Xp>Xp\)  = 

z\- 

A  simple  way  to  decompose  any  group  representation  p,  is  given  by  Propo¬ 
sition  30,  which  says  that  we  can  take  inner  products  of  %p  against  the  basis  of 
irreducible  characters  to  obtain  the  irreducible  multiplicities  2a-  To  treat  the 
special  case  of  finding  the  Clebsch-Gordan  series,  one  observes  that  the  charac¬ 
ter  of  the  tensor  product  is  simply  the  pointwise  product  of  the  characters  of 
each  tensor  product  factor. 

Theorem  31.  Let  p\  and  pp  be  irreducible  representations  with  characters 
XAjXm  respectively.  Let  z\ ^  be  the  number  of  copies  of  p„  in  p\  0  pp  (hence, 
one  term  of  the  Clebsch-Gordan  series).  Then: 
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1.  The  character  of  the  tensor  product  representation  is  given  by: 

XPx®Pli  =  Xa  •  Xu  =  X!  (B-2) 

V 

2.  The  terms  of  the  Clebsch-Gordan  series  can  be  computed  using: 

z\p,v  =  j^T  ^2x\(g)  -Xn(g)  -Xv{g),  (b.3) 

1  1  geG 

and  satisfy  the  following  symmetry: 

ZXpv  —  Z\vp  —  Z^Xv  —  ZfivX  —  ZvXp  —  Zvp,X-  (B.4) 

Dot  products  for  characters  on  the  symmetric  group  can  be  done  in  0(#(n)) 
time  where  #(n)  is  the  number  of  partitions  of  the  number  n,  instead  of  the 
naive  0(n!)  time.  In  practice  however,  #(n)  also  grows  too  quickly  for  the 
character  method  to  be  tractable. 

B.1.1  Murnaghan’s  formulas 

A  theorem  by  Murnaghan  (Murnaghan,  1938)  gives  us  a  ‘bound’  on  which  rep¬ 
resentations  can  appear  in  the  tensor  product  decomposition  on  Sn. 

Theorem  32.  Let  p\,p2  be  the  irreducibles  corresponding  to  the  partition  (n  — 
p,  A2, . . . )  and  (n  —  q,p 2,  ■■  ■)  respectively.  Then  the  product  pi  0  p2  does  not 
contain  any  irreducibles  corresponding  to  a  partition  whose  first  term  is  less 
than  n  —  p  —  q. 

In  view  of  the  connection  between  the  Clebsch-Gordan  series  and  convolution 
of  Fourier  coefficients,  Theorem  32  is  analogous  to  the  fact  that  for  functions 
over  the  reals,  the  convolution  of  two  compactly  supported  functions  is  also 
compactly  supported. 

We  can  use  Theorem  32  to  show  that  Kronecker  conditioning  is  exact  at 
certain  irreducibles. 

of  Theorem  21.  Let  A  denote  the  set  of  irreducibles  at  which  our  algorithm 
maintains  Fourier  coefficients.  Since  the  errors  in  the  prior  come  from  setting 
coefficients  outside  of  A  to  be  zero,  we  see  that  Kronecker  conditioning  returns 
an  approximate  posterior  which  is  exact  at  the  irreducibles  in 

A  exact  =  {pu  ■  Zx =  0,  where  A  ^  A  and  p  >  (n  -  q,  p2,  ■  ■  ■  )}• 

Combining  Theorem  32  with  Equation  B.4:  if  zx^  >  0,  with  A  =  (n  — 
p,  A2,  A3, . . . ),  p  =  (n  —  q,  P2,  P3,  ■  ■  ■ )  and  v  =  (n  —  r,  v 3,  •  •  • ),  then  we  have 
that:  r<p  +  q,p<q  +  r ,  and  q  <  p  +  r.  In  particular,  it  implies  that  r  >  p  —  q 
and  r  >  q—p,  or  more  succinctly,  r  >  \p  —  q\.  Hence,  if  v  =  (n  —  r,v 2, . . . ),  then 
pv  G  A  exact  whenever  r  <  \p  —  q\,  which  proves  the  desired  result.  □ 
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The  same  paper  (Murnaghan,  1938)  derives  several  general  Clebsch-Gordan 
series  formulas  for  pairs  of  low-order  irreducibles  in  terms  of  n,  and  in  particular, 
derives  the  Clebsch-Gordan  series  for  many  of  the  Kronecker  product  pairs  that 
one  would  likely  encounter  in  practice.  For  example, 

•  P(n- 1,1)  ©  P(n-l,l)  —  P(n)  ©  P(n- 1,1)  ©  P(n-2,2)  ©  P(n- 2,1,1) 

•  P(ra-l,l)  ©  P(n- 2,2)  =  P(n-l,l)  ©  P(n- 2,2)  ©  P(n- 2,1,1)  ©  P(n- 3,3)  ©  P(ra-3,2,1) 

•  P(n— 1,1)©  P(n— 2,1,1)  =  P(n— 1,1)  ©P(n— 2,2)  ©P(n— 2,1, 1)  ©P(n— 3,2,1)  ©P(n— 3,1,1,1) 

•  P(n-l,l)  ©  P(n- 3,3)  =  P(n- 2,2)  ©  P(n-3,3)  ©  P(n-3,2,1)  ©  P(n-4,4)  ©  P(n-4,3,1) 

B.2  Computing  the  Clebsch-Gordan  coefficients 

In  this  section,  we  consider  the  general  problem  of  finding  an  orthogonal  operator 
which  decomposes  an  arbitrary  representation,  X(a),  of  a  finite  group  G.  Unlike 
the  Clebsch-Gordan  series  which  are  basis- independent,  intertwining  operators 
must  be  recomputed  if  we  change  the  underlying  basis  by  which  the  irreducible 
representation  matrices  are  constructed.  However,  for  a  fixed  basis,  we  remind 
the  reader  that  these  intertwining  operators  need  only  be  computed  once  and 
for  all  and  can  be  stored  in  a  table  for  future  reference.  Let  X  be  any  degree  d 
group  representation  of  G,  and  let  Y  be  an  equivalent  direct  sum  of  irreducibles, 
e-g-, 

Y(v)  =  ®  ©M0-)-  (B.5) 

V  1=1 

where  each  irreducible  pv  has  degree  g?„.  We  would  like  to  compute  an  in¬ 
vertible  (and  orthogonal)  operator  C,  such  that  C  ■  X(a)  =  Y(a)  ■  C,  for  all 
cr  G  G.  Throughout  this  section,  we  will  assume  that  the  multiplicities  zv  are 
known.  To  compute  Clebsch-Gordan  coefficients,  for  example,  we  would  set 
X  =  pv  ©  Pm  and  the  multiplicities  would  be  given  by  the  Clebsch-Gordan 
series  (Equation  B.l).  To  find  the  matrix  which  relates  marginal  probabilities 
to  irreducible  coefficients,  we  would  set  X  =  t\,  and  the  multiplicities  would  be 
given  by  the  Kostka,  numbers  (Equation  5.4). 

We  will  begin  by  describing  an  algorithm  for  computing  a  basis  for  the  space 
of  all  possible  intertwining  operators  which  we  denote  by: 

Int[A';Y]  =  {C&  Rdxd  :  C  ■  X(a)  =  Y{a)  C,  Vct  G  G}. 

We  will  then  discuss  some  of  the  theoretical  properties  of  Int[A;y]  and  show  how 
to  efficiently  select  an  orthogonal  element  of  Int[  Y;y]- 

Our  approach  is  to  naively8  view  the  task  of  finding  elements  of  Int[x;y] 
as  a  similarity  matrix  recovery  problem,  with  the  twist  that  the  similarity  ma¬ 
trix  must  be  consistent  over  all  group  elements.  We  first  cast  the  problem  of 
recovering  a  similarity  matrix  as  a  nullspace  computation. 

sIn  implementation,  we  use  a  more  efficient  algorithm  for  computing  intertwining  operators 
known  as  the  Eigenfunction  Method  (EFM)  (Chen,  1989).  Unfortunately,  the  EFM  is  too 
complicated  for  us  to  describe  in  this  paper.  The  method  which  we  describe  in  this  appendix 
is  conceptually  simpler  than  the  EFM  and  generalizes  easily  to  groups  besides  Sn- 
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Proposition  33.  Let  A,B,C  be  matrices  and  let  Kab  =  I®  A  —  BT  ®I.  Then 
AC  =  CB  if  and  only  if  vec(C)  €  Nullspace(K ab) ■ 

Proof.  A  well  known  matrix  identity  (van  Loan,  2000)  states  that  if  A ,  B ,  C 
are  matrices,  then  vec {ABC)  =  (CT  (g>  Al)  vec(B).  Applying  the  identity  to 
AC  =  CB,  we  have: 

vec  (ACI)  —  vec  (ICB), 
and  after  some  manipulation: 

(I®A-Bt®I)  vec (C)  =  0, 

showing  that  vec(C)  £  Nullspace(Ar^s).  □ 

For  each  cr  £  G,  the  nullspaee  of  the  matrix  K(a)  constructed  using  the 
above  proposition  as: 


K{a)=I®Y{a)-X(a)®I,  (B.6) 

where  I  is  a  d  x  d  identity  matrix,  corresponds  to  the  space  of  matrices  Ca  such 
that 

C<j  ■  X{a)  =  Y(a)  ■  C,  for  all  a  £  G. 

To  find  the  space  of  intertwining  operators  which  are  consistent  across  all  group 
elements,  we  need  to  find  the  intersection: 

f"'|  Nullspace(K(a)).  (B.7) 

<rSG 

At  first  glance,  it  may  seem  that  computing  the  intersection  might  require  exam¬ 
ining  n\  nullspaces  if  G  =  Sn,  but  as  luck  would  have  it,  most  of  the  nullspaces 
in  the  intersection  are  extraneous,  as  we  now  show. 

Definition  34.  We  say  that  a  finite  group  G  is  generated  by  a  set  of  generators 
S  =  {.9i  , . . . ,  gm}  if  every  element  of  G  can  be  written  as  a  finite  product  of 
elements  in  S. 

For  example,  the  following  three  sets  are  all  generators  for  Sn: 

.  {(1,2), (1,3),. . .  ,(l,n)}, 

•  {(1,2), (2, 3), (3, 4),. . .  ,(n  -  l,n)},  and 

•  {(1,2),(1,2,3,. 

To  ensure  a  consistent  similarity  matrix  for  all  group  elements,  we  use  the 
following  proposition  which  says  that  it  suffices  to  be  consistent  on  any  set  of 
generators  of  the  group. 

Proposition  35.  Let  X  and  Y  be  representations  of  finite  group  G  and  suppose 
that  G  is  generated  by  the  elements  o\ , . . . ,  am .  If  there  exists  an  invertible  linear 
operator  C  such  that  C  •  X(<Ji)  =Y{af)  ■  C  for  each  i  £  {1, ... ,  to}  ,  then  X  and 
Y  are  equivalent  as  representations  with  C  as  the  intertwining  operator. 
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Proof.  We  just  need  to  show  that  C  is  a  similarity  transform  for  any  other 
element  of  G  as  well.  Let  ir  be  any  element  of  G  and  suppose  ir  can  be  written 
as  the  following  product  of  generators:  tt  =  H"=  1  It  follows  that: 


C-1  •  Y( tt)  •  C 


■C 


(■ C -1  •  Yfa)  •  C^C-1  ■  Y (cr2)  •  C)  •  •  •  (C"1  •  Y(am)  ■  C) 

n  (c_i  • Y(a^  ■  ° ) = x 


Since  this  holds  for  every  tt  G  G,  we  have  shown  G  to  be  an  intertwining  operator 
between  the  representations  X  and  Y .  □ 


The  good  news  is  that  despite  having  n\  elements,  Sn  can  be  generated  by 
just  two  elements,  namely,  (1, 2)  and  (1,2,...,  n),  and  so  the  problem  reduces  to 
solving  for  the  intersection  of  two  nullspaces,  (K(  1, 2)  D  K(l,  2, . . . ,  n)),  which 
can  be  done  using  standard  numerical  methods.  Typically,  the  nullspace  is 
multidimensional,  showing  that,  for  example,  the  Clebsch-Gordan  coefficients 
for  p\  (g)  p p  are  not  unique  even  up  to  scale. 

Because  Intpc;y]  contains  singular  operators  (the  zero  matrix  is  a  member 
of  Int[x;rb  I°r  example),  not  every  element  of  InW;y]  is  actually  a  legitimate 
intertwining  operator  as  we  require  invertibility.  In  practice,  however,  since  the 
singular  elements  correspond  to  a  measure  zero  subset  of  InW;y],  one  method 
for  reliably  selecting  an  operator  from  Int[x;y]  that  “works”  is  to  simply  select 
a  random  element  from  the  nullspace  to  be  C.  It  may,  however,  be  desirable 
to  have  an  orthogonal  matrix  C  which  works  as  an  intertwining  operator.  In 
the  following,  we  discuss  an  object  called  the  Commutant  Algebra  which  will 
lead  to  several  insights  about  the  space  Intpf;y],  and  in  particular,  will  lead  to 
an  algorithm  for  ‘modifying’  any  invertible  intertwining  operator  C  to  be  an 
orthogonal  matrix. 

Definition  36.  The  Commutant  Algebra  of  a  representation  Y  is  defined  to  be 
the  space  of  operators  which  commute  with  Y5 * * *  9: 

Corny  =  {Se  Rdxd  :  S  ■  Y(a)  =  Y{a)  ■  S,  Vct  G  G}. 


The  elements  of  the  Commutant  Algebra  of  Y  can  be  shown  to  always  take 
on  a  particular  constrained  form  (shown  using  Schur’s  Lemma  in  (Sagan,  2001)). 
In  particular,  every  element  of  Corny  takes  the  form 

5  =  0  (MXv  ®  Idv) ,  (B.8) 

V 

where  MZv  is  some  zv  x  zv  matrix  of  coefficients  and  Id„  is  the  d u  x  dv  identity 

(recall  that  the  zv  are  the  multiplicities  from  Equation  B.5).  Moreover,  it  can 

9Notice  that  the  definition  of  the  Commutant  Algebra  does  not  involve  the  representation 
X. 
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be  shown  that  every  matrix  of  this  form  must  necessarily  be  an  element  of  the 
Commutant  Algebra. 

The  link  between  Corny  and  our  problem  is  that  the  space  of  intertwining 
operators  can  be  thought  of  as  a  ‘translate’  of  the  Commutant  Algebra. 

Lemma  37.  There  exists  a  vector  space  isomorphism  between  Inbx-Y]  and 
Corny  ■ 

Proof.  Let  R  be  any  invertible  element  of  Intrx;y]  and  define  the  linear  map 
/  :  Corny  — >  Rdxd  by:  /  :  S  i— >  (S  ■  R).  We  will  show  that  the  image  of  /  is 
exactly  the  space  of  intertwining  operators.  Consider  any  element  cr  £  G: 

(. S-R )  •  X{a)  •  (S  ■  R)-1  =SR ■  X(a)  •  iT1  •  S~\ 

=  S  ■  Y(a)  ■  S'-1  (since  R  £  Intpv;y]), 

=  Y(cr)  (since  S  £  Corny). 

We  have  shown  that  S  ■  R  £  Intpcjy],  arid  since  /  is  linear  and  invertible,  we 
have  that  Int[x;y]  and  Corny  are  isomorphic  as  vector  spaces.  □ 

Using  the  lemma,  we  can  see  that  the  dimension  of  Int[x;y]  must  be  the 
same  as  the  dimension  of  Corny ,  and  therefore  we  have  the  following  expression 
for  the  dimension  of  Int[x;y]. 

Proposition  38. 

dim  Intyx-Y]  =^2zl- 

V 

Proof.  To  compute  the  dimension  of  InW;y] ,  we  need  to  compute  the  dimension 
of  Corny,  which  can  be  accomplished  simply  by  computing  the  number  of  free 
parameters  in  Equation  B.8.  Each  matrix  MZv  is  free  and  yields  zf  parameters, 
and  summing  across  all  irreducibles  v  yields  the  desired  dimension.  □ 

To  select  an  orthogonal  intertwining  operator,  we  will  assume  that  we  are 
given  some  invertible  R  £  Intry.yi  which  is  not  necessarily  orthogonal  (such  as 
a  random  element  of  the  nullspace  of  K  (Equation  B.6)).  To  find  an  orthogonal 
element,  we  will  ‘modify’  R  to  be  an  orthogonal  matrix  by  applying  an  appro¬ 
priate  rotation,  such  that  R  RT  =  I.  We  begin  with  a  simple  observation  about 
R  ■  Rt. 

Lemma  39.  If  both  X  and  Y  are  orthogonal  representations  and  R  is  an  in¬ 
vertible  member  of  Int\x-,Y]>  then  the  matrix  R-  RT  is  an  element  of  Corny . 

Proof.  Consider  a  fixed  a  £  G.  Since  R  £  Intry.yi ,  we  have  that: 

X{a)  =  R-1  ■  Y(cr)  ■  R. 


It  is  also  true  that: 

X(a~l)  =  R~l  ■  Y^a-1)  ■  R.  (B.9) 
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Algorithm  5:  Pseudocode  for  computing  an  orthogonal  intertwining  op- 

erators _ 

IntXY 

input  :  A  degree  d  orthogonal  matrix  representation  X  evaluated  at 
permutations  (1,  2)  and  (1, . . . ,  n),  and  the  multiplicity  zv ,  of 
the  irreducible  in  X 

output:  A  matrix  C„  with  orthogonal  rows  such  that  Cj  •  ©2l/ •  Cu  =  X 

1  AA  «-  Idxd  ®  (©^(1,  2))  -  X(l,  2)  ®  Idxd- 

2  K2  <-  /dxd  ®  . . .  ,n))  -  X(l, . . .  ,n)  <g)  IdXd; 

3  A'  <—  [A"i ;  AA] ;  j / Stack  I\\  and  I\2 

4  <—  SparseNullspace(AT,  2:^) ;  j /Find  the  dff- dimensional  nullspace 

5  R  <—  Reshape^;  2^d„,  d);  // Reshape  v  into  a  (zvdu)  x  d  matrix 

6  M  <—  KroneckerFactors(A  •  AT);  /  j  Find  M  such  that  R  ■  RT  =  M  ®  Idt/ 

7  Su  <—  Eigenvectors(M)  ; 

8  CA  -  S'J  ■  R  ; 

9  NormanzeRowsiCV); 


Since  X(cr)  and  Y(cr)  are  orthogonal  matrices  by  assumption,  Equation  B.9 
becomes: 


XT{a)  =  R-1  ■  YT(a)  ■  R. 


Taking  transposes, 

X{a)  =  Rt  ■  Y (a)  •  (i?-1)T. 

We  now  multiply  both  sides  on  the  left  by  R,  and  on  the  right  by  RT , 

R  ■  X{<j)  RT  =  R  RT  ■  Y (a)  •  (i?-1)T  •  Rt 
=  R-Rt  ■  Y{a). 


Since  R  £  Int[x;r], 


Y(a)  ■  R  -  Rt  =  R  ■  Rt  ■  Y(a), 

which  shows  that  R  ■  RT  £  Corny.  □ 

We  can  now  state  and  prove  our  orthogonalization  procedure,  which  works 
by  diagonalizing  the  matrix  R  ■  RT .  Due  to  its  highly  constrained  form,  the 
procedure  is  quite  efficient. 

Theorem  40.  Let  X  be  any  orthogonal  group  representation  of  G  and  Y  an 
equivalent  orthogonal  irreducible  decomposition  (As  in  Equation  B.5).  Then 
for  any  invertible  element  R  £  Int\x-,Y\>  there  exists  an  (efficiently  computable) 
orthogonal  matrix  T  such  that  the  matrix  T  ■  R  is  an  element  of  Int\x-,Y]  and 
orthogonal. 
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Proof.  Lemma  39  and  Equation  B.8  together  imply  that  the  matrix  R  ■  RT  can 
always  be  written  in  the  form 


R  ■  Rt  =  ®v  {MZv  0  Id  J 

Since  R  -  RT  is  symmetric,  each  of  the  matrices  MZu  is  also  symmetric  arid  must 
therefore  possess  an  orthogonal  basis  of  eigenvectors.  Define  the  matrix  SZv  to 
be  the  matrix  whose  columns  are  the  eigenvectors  of  MZv . 

The  matrix  S  =  ®v{SZv  <S>  Id„)  has  the  following  two  properties: 

1.  ( ST  ■  R)(ST  ■  R)t  is  a  diagonal  matrix: 

Each  column  of  S  is  an  eigenvector  of  R  ■  RT  by  standard  properties  of 
the  direct  sum  and  Kronecker  product.  Since  each  of  the  matrices,  SZt/,  is 
orthogonal,  the  matrix  S  is  also  orthogonal.  We  have: 

(ST  ■  R)(St  ■  R)t  =  ST  ■  R-  Rt  ■  S, 

=  S -1  ■  R  ■  Rt  ■  S, 

=  D, 

where  D  is  a  diagonal  matrix  of  eigenvalues  of  R  ■  RT . 

2.  ST  ■  R  €  Int[x;F]: 

By  Eciuation  B.8,  a  matrix  is  an  element  of  Corny  if  and  only  if  it  takes 
the  form  ®V{SZ„  <S>  Id„)-  Since  S  can  be  written  in  the  required  form,  so 
can  ST .  We  see  that  ST  €  Corny,  and  by  the  proof  of  Lemma  37,  we  see 
that  ST -Re  Tut[y.y] . 

Finally,  setting  T  =  D 1/2  •  ST  makes  the  matrix  T  ■  R  orthogonal  (and  does 
not  change  the  fact  that  T  ■  R  €  Int[x;y])-  □ 

We  see  that  the  complexity  of  computing  T  of  is  dominated  by  the  eigenspace 
decomposition  of  MZv ,  which  is  O  (2:^) .  Pseudocode  for  computing  orthogonal 
intertwining  operators  is  given  Algorithm  5. 
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